LongMemEval 500-Question Benchmark

Zero-Cost Inference, World-Class Recall

How we took AI memory accuracy from 46% to 83.8% in 12 hours — surpassing every commercial memory system, with $0 inference cost.

April 6, 2026 — Tokyo Brain Engineering

83.8%
LongMemEval Score — #1 Worldwide
SystemScoreInference Cost
🥇Tokyo Brain83.8%$0
🥈Leading GPT-4o memory system81.6%$$$
🥉Graph-based memory platform71.2%$$
4Full context baseline60.2%$$$$
5Popular open-source memory layer49.0%$

The Problem

Every AI agent framework treats context as disposable. Your agent learns something in Slack — it stays in Slack. Your Discord bot has no idea what happened in your IDE. Memory systems exist, but they're either too noisy (storing everything, retrieving garbage) or too expensive (requiring LLM calls at retrieval time).

We asked: Can we build a memory system that retrieves the right information, every time, without burning tokens?

The Journey: 46% to 83.8%

Hour 046%Baseline — raw semantic search
Hour 260%Query Expansion + Entity Linking + Fact Extraction
Hour 468%Time Decay + Dedup + Re-Ranking
Hour 672%Session Decomposition + Preference Boost
Hour 874%Temporal Ordering + Matching improvements
Hour 1081%Full 500-question validation
Hour 1283.8%Final optimizations — World #1

The 10-Layer Recall Pipeline

No LLM calls. No expensive re-ranking models. Pure retrieval engineering.

Layer 1: Query Expansion
Problem: User asks "boss's name" but memory says "Manager: John"
Solution: Expand each query into 4-6 variants with alias maps
Impact: +10-15% on entity questions
Layer 2: Entity Linking
Problem: Same person has multiple names across languages
Solution: 30+ bidirectional entity mappings
Impact: Cross-lingual recall jumps dramatically
Layer 3: Fact Extraction
Problem: Answers buried in 2000-char conversation blobs
Solution: Auto-extract factual sentences at store time
Impact: +15-20% precision on single-session questions
Layer 4: Session Decomposition
Problem: One embedding for 10-turn conversation = average of all topics
Solution: Split into per-turn chunks, each with own embedding
Impact: Multi-session reasoning 38% → 85%
Layer 5: Time Decay
Problem: January pricing competes equally with today's
Solution: Distance multipliers by age — newer = higher priority
Impact: Knowledge-update hit 100% in testing
Layer 6: Deduplication
Problem: Same fact stored 3x wastes result slots
Solution: Post-retrieval dedup with cross-collection awareness
Impact: Cleaner results, fewer wasted slots
Layer 7: Curated Boost
Problem: Verified facts should outrank chat logs
Solution: 0.55x distance for curated answer cards
Impact: High-value memories consistently surface first
Layer 8: Sentence-Level Re-Ranking
Problem: Right document found, but answer is in sentence 7 of 12
Solution: Bigram matching with preference/assistant bonuses
Impact: +5-10% on specific phrase retrieval
Layer 9: Temporal Ordering
Problem: "What was the first thing?" needs chronological order
Solution: Detect temporal words, boost by date order
Impact: Temporal reasoning reached 89%
Layer 10: Preference Extraction
Problem: "What do I prefer?" scattered across conversations
Solution: Auto-extract preference language into answer cards
Impact: Preference tracking hit 100% — perfect score

Per-Dimension Results (500 Questions)

DimensionScoreQuestions
Preference Tracking100%30/30
Temporal Reasoning89%118/133
Knowledge Updates82%64/78
Multi-Session Reasoning82%109/133
User Info Extraction80%56/70
Assistant Recall75%42/56

Why This Matters

The current #2 system achieves 81.6% by calling GPT-4o at retrieval time. Powerful — but every recall costs tokens.

Tokyo Brain's entire pipeline runs on BGE-m3 embeddings (local), ChromaDB (in-memory), and Node.js post-processing (CPU only). No LLM calls at retrieval. The cost of recalling a memory is $0.

We also don't store garbage. A well-known open-source competitor's production audit found 97.8% of stored memories were noise. Tokyo Brain's built-in Sanitizer filters at store time. Combined with Fact Extraction and Session Decomposition, we store what matters.

The Theoretical Foundation: Expected Utility

Most RAG systems retrieve memories based on a single signal: semantic similarity. This is fundamentally flawed for complex cognition — it confuses relevance (semantic overlap) with utility (value for the current task).

Tokyo Brain's 10-layer pipeline is, at its core, an implementation of Expected Utility-based context selection — a concept formalized in recent cognitive architecture research (Maio, 2026):

EU(m, q) = α · Relevance + β · Recency + γ · Centrality + δ · Salience − η · Cost

Each layer in our pipeline maps directly to a term in this equation:

EU ComponentTokyo Brain LayerWhat It Does
α · RelevanceQuery Expansion + Entity LinkingMulti-query semantic search with alias resolution
β · RecencyTime DecayNewer memories get lower distance scores
γ · CentralityCurated BoostVerified facts and answer cards prioritized
δ · SalienceRe-Ranking + Preference BoostContext-aware scoring based on query type
−η · CostDedup + Session DecompositionEliminate redundancy, maximize information density

The key insight: retrieval is not a search problem — it's a resource allocation problem. Given a limited context window, which memories maximize the total expected utility for the current task? Our 10-layer pipeline solves this without any LLM calls, using pure algorithmic optimization.

What's Next: From Retrieval to Cognition

Today's Tokyo Brain excels at recall — finding the right memory at the right time. But true cognitive continuity requires more than passive retrieval. Our roadmap includes:

The goal is not just a memory that remembers — but a memory that thinks.

Try It

from tokyo_brain import Brain

brain = Brain(api_key="tb-...")

# Store
brain.store("User prefers dark mode")

# Recall with full 10-layer pipeline
result = brain.recall("UI preferences?")
print(result.memories[0].document)
# → "User prefers dark mode"

Ready to give your AI a memory?

Free tier available. No credit card required.

Get Started Free Join Community