The Problem No One Wanted to Talk About
Here's something that probably sounds familiar. You're working with an AI agent on a complex project — maybe it's a coding task, maybe it's an ongoing workflow — and it's going great. The agent understands your preferences, your stack, your conventions. And then the session ends. Or the context window fills up. And poof. It's like none of it happened. You're starting from scratch.
This isn't a niche edge case. It's one of the most fundamental limitations of AI agents as they exist today. Even as context windows have grown past one million tokens, a natural tension keeps emerging between two bad options: keep everything in context and watch quality degrade, or aggressively prune and risk losing information the agent needs later. That problem even has a name now — context rot.
Cloudflare's answer to this is Agent Memory, a managed service that extracts information from agent conversations and makes it available when it's needed, without filling up the context window.
What Agent Memory Actually Does
Think of it less like a database and more like a really good colleague who pays attention. It gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time.
The service is accessed via a binding from any Cloudflare Worker — or via a REST API for agents running outside of Workers. A memory context is organized into a profile, addressed by name, and each profile supports a handful of clean operations:
- Ingest — the bulk path, typically called when the harness compacts context
- Remember — for the model to store something important on the spot
- Recall — runs the full retrieval pipeline and returns a synthesized answer
- Forget — marks a memory as no longer relevant or true
- List — lets the agent see what's stored
Agent Memory is a managed service with an opinionated API and retrieval-based architecture. That's a deliberate choice. Rather than giving agents raw filesystem access, which burns tokens on storage strategy instead of the actual task, tighter ingestion and retrieval pipelines are superior for production workloads that need temporal logic, supersession, and reliable instruction-following.
The Ingestion Pipeline: More Rigorous Than You'd Expect
When a conversation arrives for ingestion, it doesn't just get dumped into a vector store. It passes through a multi-stage pipeline that extracts, verifies, classifies, and stores memories.
Deterministic IDs and Parallel Extraction
The first step is generating a content-addressed ID for each message — a SHA-256 hash of session ID, role, and content, truncated to 128 bits. If the same conversation is ingested twice, every message resolves to the same ID, making re-ingestion idempotent.
The extractor then runs two passes in parallel. A full pass chunks messages at roughly 10K characters with two-message overlap and processes up to four chunks concurrently. Each chunk gets a structured transcript with role labels, relative dates resolved to absolutes — "yesterday" becomes a specific date — and line indices for source provenance.
For longer conversations — nine messages or more — a detail pass runs alongside the full pass, specifically targeting concrete values like names, prices, version numbers, and entity attributes that broader extraction tends to miss.
Verification and Classification
Here's where it gets genuinely careful. The verifier runs eight checks covering entity identity, object identity, location context, temporal accuracy, organizational context, completeness, relational context, and whether inferred facts are actually supported by the conversation. Each item is passed, corrected, or dropped accordingly.
Memories are then classified into four types:
- Facts — atomic, stable knowledge like "the project uses GraphQL" or "the user prefers dark mode"
- Events — what happened at a specific time, like a deployment or a decision
- Instructions — how to do something, runbooks, procedures, workflows
- Tasks — what's being worked on right now, ephemeral by design
Facts and instructions are keyed. Each gets a normalized topic key, and when a new memory has the same key as an existing one, the old memory is superseded rather than deleted. This creates a version chain with a forward pointer from the old memory to the new memory. Tasks, meanwhile, are excluded from the vector index to keep it lean, but remain discoverable via full-text search.
How Retrieval Actually Works
Retrieval is where Agent Memory gets interesting — and where you'd expect most systems to fall short. During development, Cloudflare discovered that no single retrieval method works best for all queries, so they run several methods in parallel and fuse the results.
Five Retrieval Channels Running at Once
The first stage runs query analysis and embedding concurrently. The query analyzer produces ranked topic keys, full-text search terms with synonyms, and a HyDE (Hypothetical Document Embedding) — a declarative statement phrased as if it were the answer to the question.
Then five retrieval channels fire in parallel:
- Full-text search with Porter stemming — handles keyword precision
- Exact fact-key lookup — direct match to known topic keys
- Raw message search — a safety net for verbatim details the extraction pipeline may have generalized away
- Direct vector search — semantically similar memories
- HyDE vector search — finds memories similar to what the answer would look like, which often surfaces results that direct embedding misses — particularly for abstract or multi-hop queries where the question and the answer use different vocabulary
Reciprocal Rank Fusion to Merge It All
Results from all five channels are merged using Reciprocal Rank Fusion (RRF), where each result receives a weighted score based on where it ranked within a given channel. Fact-key matches get the highest weight because an exact topic match is the strongest signal. Ties are broken by recency. Then a synthesis model generates a natural-language answer. And for things like date math — temporal computation is handled deterministically via regex and arithmetic, not by the LLM, because models are unreliable at date math.
Built on Cloudflare's Own Infrastructure
Under the hood, Agent Memory is a Cloudflare Worker that coordinates several systems: a Durable Object that stores raw messages and classified memories, Vectorize for vector search, and Workers AI for running the LLMs and embedding models.
Each memory context maps to its own Durable Object instance and Vectorize index, keeping data fully isolated between contexts. The DO handles full-text search indexing, supersession chains, and transactional writes. Memory content lives in SQLite-backed DOs, vectors live in Vectorize, and exports will go to R2 for cost-efficient long-term storage.
Model Selection: Bigger Isn't Always Better
One of the more honest findings from development: a bigger, more powerful model isn't always better. The current default is Llama 4 Scout for extraction, verification, classification, and query analysis, and Nemotron 3 for synthesis. Scout handles structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers. The synthesizer is the only stage where throwing more parameters at the problem consistently helped.
What You Can Actually Build With It
The use cases here are broader than you might initially think.
Individual Agent Memory
Regardless of whether you're building with coding agents like Claude Code or OpenCode with a human in the loop, using self-hosted agent frameworks to act on your behalf, or wiring up managed services like Anthropic's Managed Agents, Agent Memory can serve as the persistent memory layer without any changes to the agent's core loop.
Shared Memory Across Teams
This one is genuinely compelling. A memory profile doesn't have to belong to a single agent. A team of engineers can share a memory profile so that knowledge learned by one person's coding agent is available to everyone: coding conventions, architectural decisions, tribal knowledge that currently lives in people's heads or gets lost when context is pruned. What your agents learn stops being ephemeral and starts becoming a durable team asset.
How Cloudflare Uses It Internally
Cloudflare runs Agent Memory for its own workflows. For code review, arguably the most useful thing it learned to do was stay quiet. The reviewer now remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged, and the author chose to keep it for a good reason. Reviews get less noisy over time, not just smarter.
They also use it for an internal chat bot that ingests message history and lurks and remembers new messages that are sent, so when someone asks a question, the bot can answer based on previous conversations.
Your Data Stays Yours
Worth noting explicitly, because it matters. As agents become more capable and more deeply embedded in business processes, the memory they accumulate becomes genuinely valuable — not just as operational state, but as institutional knowledge that took real work to build.
Cloudflare's position is straightforward: every memory is exportable, and they're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. The right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to.
Agent Memory vs. AI Search: Not the Same Thing
It's worth drawing a clear line here. While search is a component of memory, agent search and agent memory solve distinct problems. AI Search is the primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files — it's derived from sessions. An agent can use both, and they're designed to work together.

