empathic memory bench v3

recall@3 · n=100 probe suite · 60 events · cosine leads overall (0.420 vs 0.416) · Pulse v3 leads the stateful axis (0.419 vs cosine_state 0.314) · we say both plainly

// primary suite — retrieval baselines

system	overall R@3	core	stateful	multi-signal	chain
cosine	0.420	0.583	0.343	0.307	0.533
Pulse v3	0.416	0.517	0.419	0.333	0.412
cosine_state	0.390	0.517	0.314	0.293	0.517
hybrid	0.285	0.500	0.219	0.173	0.325
hybrid_state	0.262	0.433	0.219	0.133	0.325
state_concat_only	0.203	0.100	0.124	0.147	0.517
bm25	0.156	0.350	0.086	0.080	0.179

// memory-system adapters, same corpus

system	overall R@3	core	stateful	multi-signal	chain
Pulse v3 (Cohere embed-v4.0)	0.416	0.517	0.419	0.333	0.412
claude-mem	0.400	0.600	0.305	0.333	0.450
LangMem	0.397	0.617	0.352	0.253	0.433
LlamaIndex Memory	0.397	0.617	0.352	0.253	0.433
Pulse v3 (TE3-small, backbone-matched)	0.375	0.467	0.390	0.280	0.375
Mem0	0.347	0.567	0.257	0.280	0.371
OpenAI Memory (TE3-large)	0.307	0.550	0.257	0.107	0.404
Graphiti (Zep)	0.120	0.433	0.038	0.040	0.050

// stateful axis — delta vs pulse v3

cosine+0.076 stateful R@3 · +22% relative
cosine_state+0.105 stateful R@3 · +33% relative
hybrid+0.200 stateful R@3 · +91% relative
hybrid_state+0.200 stateful R@3 · +91% relative
state_concat_only+0.295 stateful R@3 · +238% relative
bm25+0.333 stateful R@3 · +387% relative
claude-mem+0.114 stateful R@3 · +37% relative
LangMem+0.067 stateful R@3 · +19% relative
LlamaIndex Memory+0.067 stateful R@3 · +19% relative
Mem0+0.162 stateful R@3 · +63% relative
OpenAI Memory (TE3-large)+0.162 stateful R@3 · +63% relative
Graphiti (Zep)+0.381 stateful R@3 · +1003% relative

the stateful axis is the paper's supported claim: same query, different user state, different ideal episode. on overall R@3 Pulse v3 does not lead — cosine does (+0.004), and cosine also leads core and chain. backbone-matched Pulse (TE3-small) is not the overall adapter winner either — claude-mem, LangMem and LlamaIndex are ahead on overall. what survives every cut is the stateful lead.

// method

Corpus60 events + 100 probes (original 35 preserved byte-for-byte + 65 new, stratified), behavioural memory tests for AI companions. Four probe types: core, stateful, multi-signal, chain. All 60 events referenced in at least one ideal_top_3 / ideal_chain.
MetricR@3 = |retrieved_top_3 ∩ ideal_top_3| / |ideal_top_3|. Chain probes scored as unordered membership against ideal_chain from saved top-5 retrievals.
BackbonePrimary suite: Cohere embed-v4.0 across all baseline rows — no backbone advantage for Pulse. Adapter table: OpenAI gpt-4o-mini + text-embedding-3-{small,large}, with a backbone-matched Pulse (TE3-small) row as the fair comparator.
Fine-tune ablationAn earlier n=35 bge-m3 fine-tune lift (+0.067 stateful) did not survive expansion: at n=100 the fine-tuned backbone is a negative absolute result (stateful 0.378 ± 0.072 across 3 seeds vs Cohere 0.419). What is preserved across backbones is the stateful lead over hybrid_state: +0.200 on Cohere, +0.226 on fine-tuned bge-m3. We report this as disclosed in the paper.
ScopeSingle-user deployment regression suite derived from twelve months of real companion use. Not a cross-user benchmark; no claim of broad memory-system superiority. Adapter rows are protocol sanity checks, not native-system evaluations.
Reproducegithub.com/zbs-gg/emo-bench. Adapters in external-evals/scripts/run_*_on_v3_bench.py, raw retrievals in external-evals/results/. Paper source + PDF: github.com/zbs-gg/pulse-paper.
Snapshotprimary n=100 run 2026-05-16 (frozen JSON canonical) · adapter snapshot 2026-05-11 · graphiti-core 0.29.0 · mem0 2.0.0 · langmem 0.0.30 · llama-index 0.13 · openai-python 2.32.0 · live-endpoint drift check 2026-06-02 preserved central stateful values (0.419 / 0.314).

empathic memory bench v3empathic memory bench v3

// primary suite — retrieval baselines// primary suite — retrieval baselines

// memory-system adapters, same corpus// memory-system adapters, same corpus

// stateful axis — delta vs pulse v3// stateful axis — delta vs pulse v3

// method// method

empathic memory bench v3

// primary suite — retrieval baselines

// memory-system adapters, same corpus

// stateful axis — delta vs pulse v3

// method