AI intel digest
Hierarchical Memory: Context Management in Agents — Sally-Ann Delucia, Arize
Sally-Ann Delucia from Arize AI presents a year of lessons from building Alyx, an AI agent that analyzes its own trace d
Executive summary
1. SUMMARY Sally-Ann Delucia from Arize AI presents a year of lessons from building Alyx, an AI agent that analyzes its own trace data. The core problem: agents analyzing observability data create a vicious loop where trace spans grow until context limits are hit, causing failures. The talk covers why naive truncation breaks reasoning, why summarization is unreliable, and how Arize solved it with head/tail preservation plus retrievable memory stores. Additional topics include long-session evaluation strategies, sub-agents for offloading heavy data tasks, and findings from the Claude Code source release. 2. KEY FACTS FACT: Arize built an AI agent called Alyx that operates on trace and span data from their observability platform | EVIDENCE: "Alex is built on top of Arise, which is our observability platform. So we have to deal with all of the traces that come with AI agents" | CONFIDENCE: HIGH FACT: Naive truncation (keeping only the first 100 characters and dropping the rest) caused the agent to forget everything, making follow-ups look like new conversations | EVIDENCE: "we started off just taking the first 100 characters and then we just dropped the rest... the agent ultimately just forgot everything. Uh, follow-ups looked like new conversations" | CONFIDENCE: HIGH FACT: Summarization as a context management strategy was inconsistent and unreliable because it gave the LLM too much control over what was important | EVIDENCE: "summarization... was too inconsistent. There was no control over what was important. You know, we're just leaving it to the LLM to look at the data, decide what to do with it" | CONFIDENCE: HIGH FACT: Arize's current solution is "smart truncation" keeping the head (first 100 characters) and tail (last 100 characters), storing the middle in a retrievable memory store | EVIDENCE: "we take the beginning still 100 characters. We also take a 100 off of the tail... and then we take out the middle and we basically store that" | CONFIDENCE: HIGH FACT: The agent can retrieve stored context when needed, such as important tool calls or previous messages | EVIDENCE: "if Alex feels like there's a tool call that was important or a message from the previous conversation that's important, it can always go back and grab that context" | CONFIDENCE: HIGH FACT: Long session evaluations involve loading 10 turns and testing the 11th to catch context degradation bugs before users report them | EVIDENCE: "we load 10 turns and then we test the 11 to understand how the context is doing. And so these bugs really become testable" | CONFIDENCE: HIGH FACT: Arize offloaded heavy data search tasks to sub-agents, keeping the main conversation light with only chat history and light context | EVIDENCE: "we have the main conversation with the chat and light context only... it can delegate to the sub agents... that's where the heavy data stays" | CONFIDENCE: HIGH FACT: The Claude Code source release revealed they use a similar truncation and compression strategy to Arize's approach | EVIDENCE: "the claude code code was kind of released... We were surprised that they're using kind of a similar truncation and compression strategy as we are" | CONFIDENCE: HIGH FACT: Average conversation length with Alyx grew from less than 10 turns to 20+ turns as users started doing longer workflows across the application | EVIDENCE: "when we started, I was seeing like less than 10 turns per conversation with Alex and now I'm seeing folks really go, you know, push the limits up to like 20 plus" | CONFIDENCE: HIGH FACT: Arize does not currently have true long-term memory across chat sessions; users starting new chats lose previous context | EVIDENCE: "we don't really have long-term memory... if they do decide to start a new chat, Alex really doesn't have context for that" | CONFIDENCE: HIGH 3. KEY IDEAS IDEA: Context engineering has superseded prompt engineering as the critical factor in agent success | REASONING: Early focus was on prompts, but teams realized context quality determines whether agents fail or succeed; speaker references Andrej Karpathy's statement about "plus one in context engineering over prompt engineering" | IMPLICATION: AI teams should reallocate engineering effort from prompt optimization to context selection and management strategies IDEA: Context management is a product/UX problem, not purely an engineering one | REASONING: Bad context leads to bad answers, which kills product adoption regardless of technical sophistication; speaker states "if an agent doesn't have the right data, it doesn't have the right context, it's going to give bad answers. And if you give bad answers, nobody's going to want to use your product" | IMPLICATION: Context strategies need product-level design thinking about what users actually need the agent to remember IDEA: The "vicious loop" of self-referential agent data | REASONING: Alyx analyzes trace data, which creates spans, which grow the context, which causes failures, which creates more spans; "the system analyzing the data was constrained by the data" | IMPLICATION: Agents operating on their own or similar observability data face compounding context growth that standard approaches cannot handle IDEA: Separating context (what the model sees now) from memory (what survives for retrieval) is a fundamental architectural split | REASONING: Smart truncation keeps head/tail in context while storing the middle in memory; "context decides what the model sees, memory decides what survives" | IMPLICATION: Future agent architectures should treat these as distinct systems with different optimization goals IDEA: Not all context belongs in the same agent — sub-agents enable context isolation | REASONING: Heavy data search operations with hundreds of spans were overloading the main conversation; offloading to sub-agents kept main context light while preserving access to results | IMPLICATION: Multi-agent architectures are not just for capability distribution but for context budget management IDEA: Long-session evaluation is a necessary signal for context management quality | REASONING: Bugs from context degradation appear late in conversations and are often only caught by user reports; systematic testing of turn 11 after loading 10 turns makes these bugs detectable | IMPLICATION: Teams need eval suites that specifically test multi-turn context retention, not just single-turn accuracy 4. KEY QUOTES - "The best context strategy is one that lets your agents remember what it needs to and forget what it doesn't." — Sally-Ann Delucia - "Context engineering is really choosing strategically what the model sees. It's really important that you think about what the data is that is most important and not just think about, oh, I only have x amount of tokens." — Sally-Ann Delucia - "The system analyzing the data was constrained by the data and that was a major problem for us." — Sally-Ann Delucia - "Context decides what the model sees, memory decides what survives." — Sally-Ann Delucia - "I think I was surprised the most by the fact that summarization didn't work. I think that was again like the obvious choice for us." — Sally-Ann Delucia - "Agents don't fail because of prompts, they fail because of context." — Sally-Ann Delucia - "We were surprised that they're using kind of a similar truncation and compression strategy as we are. Uh, and we were kind of hoping to get a little bit of a secret from from them." — Sally-Ann Delucia on Claude Code 5. SIGNAL POINTS - Naive truncation (keep beginning, drop rest) breaks reasoning chains; follow-ups become disconnected new conversations - Summarization seems obvious but is unreliable because it cedes control over what's important to the LLM itself - Head/tail preservation with retrievable middle storage is the working solution Arize has run for months without modification - The self-referential problem: agents analyzing their own trace data create compounding context growth that standard approaches cannot handle - Long-session bugs appear late and are caught too late without systematic evaluation; Arize tests turn 11 after loading 10 turns - Sub-agents are primarily a context management tool, not just a capability distribution mechanism; they isolate heavy data from main conversation flow - Claude Code's source release confirmed similar truncation/compression strategies, suggesting convergence on this approach across major agent implementations - Conversation lengths are growing significantly (10 to 20+ turns), making long-term memory across sessions a pressing unsolved problem 6. SOURCES MENTIONED - Andrej Karpathy: Referenced for the concept of "context engineering over prompt engineering" via an X post - Arize: The observability platform that Alyx is built on top of; speaker is Head of Product there - Alyx: Arize's AI agent for building AI applications, with 40+ skills including prompt optimization, data generation, augmentation, annotations - Claude Code: Anthropic's coding agent; source code was released/reviewed and found to use similar truncation/compression strategies to Arize - Cloud/Cursor: Mentioned as examples of applications where users keep everything in one chat vs. starting new ones 7. VERDICT This video is worth watching for anyone building or operating AI agents. The unique signal is empirical validation of what does and does not work for context management at scale, from a team that faced an unusually severe version of the problem (an agent analyzing its own trace data). Most talks recommend summarization; this one reports it failed in practice. Most discussions treat sub-agents as a capability architecture; this one frames them primarily as a context budget tool. The confirmation that Claude Code uses similar head/tail truncation strategies adds weight to the approach. The specific evaluation methodology (test turn 11 after loading 10) is immediately actionable. For practitioners, this is field-tested guidance with concrete numbers and failure modes, not theoretical advice. --- COUNT: 10 facts, 0 assumptions, 0 demonstrations SIGNAL DENSITY: 85
Signal points
- 1
Naive truncation (keep beginning, drop rest) breaks reasoning chains; follow-ups become disconnected new conversations
- 2
Summarization seems obvious but is unreliable because it cedes control over what's important to the LLM itself
- 3
Head/tail preservation with retrievable middle storage is the working solution Arize has run for months without modification
- 4
The self-referential problem: agents analyzing their own trace data create compounding context growth that standard approaches cannot handle
- 5
Long-session bugs appear late and are caught too late without systematic evaluation; Arize tests turn 11 after loading 10 turns
- 6
Sub-agents are primarily a context management tool, not just a capability distribution mechanism; they isolate heavy data from main conversation flow
- 7
Claude Code's source release confirmed similar truncation/compression strategies, suggesting convergence on this approach across major agent implementations
- 8
Conversation lengths are growing significantly (10 to 20+ turns), making long-term memory across sessions a pressing unsolved problem
Key ideas
Context engineering has superseded prompt engineering as the critical factor in agent success
Why: Early focus was on prompts, but teams realized context quality determines whether agents fail or succeed; speaker references Andrej Karpathy's statement about "plus one in context engineering over prompt engineering"
Implication: AI teams should reallocate engineering effort from prompt optimization to context selection and management strategies
Context management is a product/UX problem, not purely an engineering one
Why: Bad context leads to bad answers, which kills product adoption regardless of technical sophistication; speaker states "if an agent doesn't have the right data, it doesn't have the right context, it's going to give bad answers. And if you give bad answers, nobody's going to want to use your product"
Implication: Context strategies need product-level design thinking about what users actually need the agent to remember
The "vicious loop" of self-referential agent data
Why: Alyx analyzes trace data, which creates spans, which grow the context, which causes failures, which creates more spans; "the system analyzing the data was constrained by the data"
Implication: Agents operating on their own or similar observability data face compounding context growth that standard approaches cannot handle
Separating context (what the model sees now) from memory (what survives for retrieval) is a fundamental architectural split
Why: Smart truncation keeps head/tail in context while storing the middle in memory; "context decides what the model sees, memory decides what survives"
Implication: Future agent architectures should treat these as distinct systems with different optimization goals
Not all context belongs in the same agent — sub-agents enable context isolation
Why: Heavy data search operations with hundreds of spans were overloading the main conversation; offloading to sub-agents kept main context light while preserving access to results
Implication: Multi-agent architectures are not just for capability distribution but for context budget management
Long-session evaluation is a necessary signal for context management quality
Why: Bugs from context degradation appear late in conversations and are often only caught by user reports; systematic testing of turn 11 after loading 10 turns makes these bugs detectable
Implication: Teams need eval suites that specifically test multi-turn context retention, not just single-turn accuracy
Key facts
Arize built an AI agent called Alyx that operates on trace and span data from their observability platform
HIGHEvidence: Alex is built on top of Arise, which is our observability platform. So we have to deal with all of the traces that come with AI agents
Naive truncation (keeping only the first 100 characters and dropping the rest) caused the agent to forget everything, making follow-ups look like new conversations
HIGHEvidence: we started off just taking the first 100 characters and then we just dropped the rest... the agent ultimately just forgot everything. Uh, follow-ups looked like new conversations
Summarization as a context management strategy was inconsistent and unreliable because it gave the LLM too much control over what was important
HIGHEvidence: summarization... was too inconsistent. There was no control over what was important. You know, we're just leaving it to the LLM to look at the data, decide what to do with it
Arize's current solution is "smart truncation" keeping the head (first 100 characters) and tail (last 100 characters), storing the middle in a retrievable memory store
HIGHEvidence: we take the beginning still 100 characters. We also take a 100 off of the tail... and then we take out the middle and we basically store that
The agent can retrieve stored context when needed, such as important tool calls or previous messages
HIGHEvidence: if Alex feels like there's a tool call that was important or a message from the previous conversation that's important, it can always go back and grab that context
Long session evaluations involve loading 10 turns and testing the 11th to catch context degradation bugs before users report them
HIGHEvidence: we load 10 turns and then we test the 11 to understand how the context is doing. And so these bugs really become testable
Arize offloaded heavy data search tasks to sub-agents, keeping the main conversation light with only chat history and light context
HIGHEvidence: we have the main conversation with the chat and light context only... it can delegate to the sub agents... that's where the heavy data stays
Show 3 more facts
The Claude Code source release revealed they use a similar truncation and compression strategy to Arize's approach
HIGHEvidence: the claude code code was kind of released... We were surprised that they're using kind of a similar truncation and compression strategy as we are
Average conversation length with Alyx grew from less than 10 turns to 20+ turns as users started doing longer workflows across the application
HIGHEvidence: when we started, I was seeing like less than 10 turns per conversation with Alex and now I'm seeing folks really go, you know, push the limits up to like 20 plus
Arize does not currently have true long-term memory across chat sessions; users starting new chats lose previous context
HIGHEvidence: we don't really have long-term memory... if they do decide to start a new chat, Alex really doesn't have context for that
Quotes
“The best context strategy is one that lets your agents remember what it needs to and forget what it doesn't." — Sally-Ann Delucia”
“Context engineering is really choosing strategically what the model sees. It's really important that you think about what the data is that is most important and not just think about, oh, I only have x amount of tokens." — Sally-Ann Delucia”
“The system analyzing the data was constrained by the data and that was a major problem for us." — Sally-Ann Delucia”
“Context decides what the model sees, memory decides what survives." — Sally-Ann Delucia”
“I think I was surprised the most by the fact that summarization didn't work. I think that was again like the obvious choice for us." — Sally-Ann Delucia”
“Agents don't fail because of prompts, they fail because of context." — Sally-Ann Delucia”