independent research lab
ZBS GG
zbs·gg·no reset. no amnesia.
empathic memory bench v3
recall@3 · n=35 corpus · 10 systems · open source · 2026-05-15 SOTA: Pulse v3 + bge-m3 fine-tuned (strict zero-shot)
// leaderboard
| system | overall R@3 | core | stateful | multi-signal | chain* |
|---|---|---|---|---|---|
| Pulse v3 (Cohere embed-v4.0, n=100 headline) | 0.416 | 0.517 | 0.419 | 0.333 | 0.412 |
| Pulse v3 (bge-m3 LoRA fine-tuned, mean of 3 seeds) | 0.375 | 0.467 | 0.378 | 0.289 | 0.388 |
| cosine (Cohere, n=100) | 0.420 | 0.583 | 0.343 | 0.307 | 0.533 |
| cosine_state (Cohere, n=100) | 0.390 | 0.517 | 0.314 | 0.293 | 0.517 |
| hybrid (Cohere, n=100) | 0.285 | 0.500 | 0.219 | 0.173 | 0.325 |
| hybrid_state (Cohere, n=100) | 0.262 | 0.433 | 0.219 | 0.133 | 0.325 |
| state_concat_only (Cohere, n=100) | 0.203 | 0.100 | 0.124 | 0.147 | 0.517 |
| bm25 (n=100) | 0.156 | 0.350 | 0.086 | 0.080 | 0.179 |
| Mem0 (text-embedding-3-small, n=35) | 0.171 | 0.333 | 0.200 | 0.233 | 0.000 |
| LangMem (text-embedding-3-small, n=35) | 0.162 | 0.400 | 0.167 | 0.200 | 0.000 |
| LlamaIndex Memory (text-embedding-3-small, n=35) | 0.162 | 0.400 | 0.167 | 0.200 | 0.000 |
| OpenAI Memory (text-embedding-3-large, n=35) | 0.152 | 0.267 | 0.200 | 0.200 | 0.000 |
| Graphiti (Zep) (text-embedding-3-small, n=35) | 0.048 | 0.200 | 0.033 | 0.033 | 0.000 |
| cross-encoder (bge-reranker-v2-m3, state-conditioned, n=35) | 0.105 | 0.333 | 0.133 | 0.067 | 0.000 |
// delta vs pulse v3
- Pulse v3 (bge-m3 LoRA fine-tuned, mean of 3 seeds)+0.041 R@3 · +11% relative
- cosine (Cohere, n=100)+-0.004 R@3 · +-1% relative
- cosine_state (Cohere, n=100)+0.026 R@3 · +7% relative
- hybrid (Cohere, n=100)+0.131 R@3 · +46% relative
- hybrid_state (Cohere, n=100)+0.154 R@3 · +59% relative
- state_concat_only (Cohere, n=100)+0.213 R@3 · +105% relative
- bm25 (n=100)+0.260 R@3 · +167% relative
- Mem0 (text-embedding-3-small, n=35)+0.245 R@3 · +143% relative
- LangMem (text-embedding-3-small, n=35)+0.254 R@3 · +157% relative
- LlamaIndex Memory (text-embedding-3-small, n=35)+0.254 R@3 · +157% relative
- OpenAI Memory (text-embedding-3-large, n=35)+0.264 R@3 · +174% relative
- Graphiti (Zep) (text-embedding-3-small, n=35)+0.368 R@3 · +767% relative
- cross-encoder (bge-reranker-v2-m3, state-conditioned, n=35)+0.311 R@3 · +296% relative
// method
- Corpus60 events + 35 probes, behavioural memory tests for AI companions. Four probe types: core, stateful, multi-signal, chain.
- MetricR@3 = |retrieved_top_3 ∩ ideal_top_3| / |ideal_top_3|, averaged across 35 probes.
- BackboneAll five production memory systems run with the same OpenAI gpt-4o-mini + text-embedding-3-{small,large}. No backbone advantage for Pulse.
- *chain footnoteChain probes (10/35) lack ideal_top_3_event_ids in the corpus — they are judge-evaluated on the chain axis in the paper, see §5. Strict R@3 credits 0 to every system on chain uniformly. Pulse v3 chain advantage shows up in the judge-rated table, not this one.
- Reproducegithub.com/zbs-gg/emo-bench. Adapters in
external-evals/scripts/run_*_on_v3_bench.py, raw retrievals inexternal-evals/results/, leaderboard inexternal-evals/results/leaderboard-v3.{md,csv}. - Snapshot2026-05-11 · graphiti-core 0.29.0 · mem0 2.0.0 · langmem 0.0.30 · llama-index 0.13 · openai-python 2.32.0