independent research lab
ZBS GG
zbs·gg·no reset. no amnesia.
empathic memory bench v3
recall@3 · n=100 probe suite · 60 events · cosine leads overall (0.420 vs 0.416) · Pulse v3 leads the stateful axis (0.419 vs cosine_state 0.314) · we say both plainly
// primary suite — retrieval baselines
| system | overall R@3 | core | stateful | multi-signal | chain |
|---|---|---|---|---|---|
| cosine | 0.420 | 0.583 | 0.343 | 0.307 | 0.533 |
| Pulse v3 | 0.416 | 0.517 | 0.419 | 0.333 | 0.412 |
| cosine_state | 0.390 | 0.517 | 0.314 | 0.293 | 0.517 |
| hybrid | 0.285 | 0.500 | 0.219 | 0.173 | 0.325 |
| hybrid_state | 0.262 | 0.433 | 0.219 | 0.133 | 0.325 |
| state_concat_only | 0.203 | 0.100 | 0.124 | 0.147 | 0.517 |
| bm25 | 0.156 | 0.350 | 0.086 | 0.080 | 0.179 |
// memory-system adapters, same corpus
| system | overall R@3 | core | stateful | multi-signal | chain |
|---|---|---|---|---|---|
| Pulse v3 (Cohere embed-v4.0) | 0.416 | 0.517 | 0.419 | 0.333 | 0.412 |
| claude-mem | 0.400 | 0.600 | 0.305 | 0.333 | 0.450 |
| LangMem | 0.397 | 0.617 | 0.352 | 0.253 | 0.433 |
| LlamaIndex Memory | 0.397 | 0.617 | 0.352 | 0.253 | 0.433 |
| Pulse v3 (TE3-small, backbone-matched) | 0.375 | 0.467 | 0.390 | 0.280 | 0.375 |
| Mem0 | 0.347 | 0.567 | 0.257 | 0.280 | 0.371 |
| OpenAI Memory (TE3-large) | 0.307 | 0.550 | 0.257 | 0.107 | 0.404 |
| Graphiti (Zep) | 0.120 | 0.433 | 0.038 | 0.040 | 0.050 |
// stateful axis — delta vs pulse v3
- cosine+0.076 stateful R@3 · +22% relative
- cosine_state+0.105 stateful R@3 · +33% relative
- hybrid+0.200 stateful R@3 · +91% relative
- hybrid_state+0.200 stateful R@3 · +91% relative
- state_concat_only+0.295 stateful R@3 · +238% relative
- bm25+0.333 stateful R@3 · +387% relative
- claude-mem+0.114 stateful R@3 · +37% relative
- LangMem+0.067 stateful R@3 · +19% relative
- LlamaIndex Memory+0.067 stateful R@3 · +19% relative
- Mem0+0.162 stateful R@3 · +63% relative
- OpenAI Memory (TE3-large)+0.162 stateful R@3 · +63% relative
- Graphiti (Zep)+0.381 stateful R@3 · +1003% relative
the stateful axis is the paper's supported claim: same query, different user state, different ideal episode. on overall R@3 Pulse v3 does not lead — cosine does (+0.004), and cosine also leads core and chain. backbone-matched Pulse (TE3-small) is not the overall adapter winner either — claude-mem, LangMem and LlamaIndex are ahead on overall. what survives every cut is the stateful lead.
// method
- Corpus60 events + 100 probes (original 35 preserved byte-for-byte + 65 new, stratified), behavioural memory tests for AI companions. Four probe types: core, stateful, multi-signal, chain. All 60 events referenced in at least one ideal_top_3 / ideal_chain.
- MetricR@3 = |retrieved_top_3 ∩ ideal_top_3| / |ideal_top_3|. Chain probes scored as unordered membership against ideal_chain from saved top-5 retrievals.
- BackbonePrimary suite: Cohere embed-v4.0 across all baseline rows — no backbone advantage for Pulse. Adapter table: OpenAI gpt-4o-mini + text-embedding-3-{small,large}, with a backbone-matched Pulse (TE3-small) row as the fair comparator.
- Fine-tune ablationAn earlier n=35 bge-m3 fine-tune lift (+0.067 stateful) did not survive expansion: at n=100 the fine-tuned backbone is a negative absolute result (stateful 0.378 ± 0.072 across 3 seeds vs Cohere 0.419). What is preserved across backbones is the stateful lead over hybrid_state: +0.200 on Cohere, +0.226 on fine-tuned bge-m3. We report this as disclosed in the paper.
- ScopeSingle-user deployment regression suite derived from twelve months of real companion use. Not a cross-user benchmark; no claim of broad memory-system superiority. Adapter rows are protocol sanity checks, not native-system evaluations.
- Reproducegithub.com/zbs-gg/emo-bench. Adapters in
external-evals/scripts/run_*_on_v3_bench.py, raw retrievals inexternal-evals/results/. Paper source + PDF: github.com/zbs-gg/pulse-paper. - Snapshotprimary n=100 run 2026-05-16 (frozen JSON canonical) · adapter snapshot 2026-05-11 · graphiti-core 0.29.0 · mem0 2.0.0 · langmem 0.0.30 · llama-index 0.13 · openai-python 2.32.0 · live-endpoint drift check 2026-06-02 preserved central stateful values (0.419 / 0.314).