Transcript

Context Engineering

**** · Hi everyone, welcome back to another build hour.

**** · I'm Michaela on the startup marketing team and I'm here today with two members of our solution architecture team. Emry live in the studio and Brian joining virtually to help address Q&A throughout the hour.

**** · Hi, I'm Emry. I work as a solution architect at OpenAI supporting digital native customers on building various of AI use cases including longunning AI agents. So today's topic is agent memory patterns which is a very exciting topic in Emry and I's first ever build hour.

**** · So if you've been following along, we started with how to build agents from scratch using responses API, then moved into agent RFT and today exploring agent memory patterns. All of the sessions are up on our YouTube channel, so definitely check them out if you want to catch up or revisit earlier builds.

**** · Though the focus on the build hour is to empower you with the best practices, tools and AI expertise to scale your company using open AI APIs and models. So for today's build hour, we'll start with an introduction to context engineering, the foundation for agent memory, and then Emry will walk through several live demos covering memory patterns reshape and fit, isolate and route, and extract and retrieve.

**** · We'll end with best practices, resources, and of course, live Q&A. On the hand side of the screen, you can drop questions into the Q&A box any time during the session. Our team is monitoring both in the room and virtually to help answer throughout, and we'll save a few for the end to go through live. with that, I'll hand it over to Emry to kick things off.

**** · Thanks, Michaela. Hi, everyone. so I'll start the first part of this the session with context engineering definition.

**** · So this nice definitions from Andre Karpati. I'll start by emphasizing that the context engineering is both an art and a science. So it's art because it involves judgment. So you have to decide what matters most at a given step of a reasoning or action processes. It's science because there are concrete patterns, methods, and measurable impacts to make context management more systematic and repeatable. So, I'll highlight that modern LLMs don't just perform based on the model quality, but they perform based on the context you give them.

**** · In this slide, I want to talk about different disciplines that comes together to present the context engineering. so it is a broader discipline than any single technique prompt engineering or retrieval. So the diagram visually represents the ecosystem of context optimization layers that together shape what the model sees and understands. So you see prompt engineering as a core principle structured output rack state and history management memory is also a crucial part. So using persistent or semi-persistent storage files, databases or memory tools to upload and retrieve key information and all of this is contained inside the larger sphere of context engineering. so we can also connect these capabilities into different product capabilities.

**** · Here is a nice summarization slide that talks about the core principles why it matters because longunning tools longunning and tool heavy agents blood tokens and degrade quality via poisoning noise and and confusion and bursting. We have three core strategies as we discussed in the beginning reshape and fit to the context window, isolate and route the amount of context to the agent and extract high quality memories to retrieve in the time. We also have prompt and tool hygiene as a core principle. So keeping system prompts lean, clear and well structured.

**** · use a small canonical set of fusion examples and minimize overlap in tools and get the tool selection. And our goal in Northstar is aiming for the smallest high signal context that maximize the likelihood of the desired outcome.

**** · And then in this slide is where our transition from why context injuring matters to how to do it in practice. So I'll frame this as a toolkit of techniques. So these are not mutually exclusive. So most real world agent architectures combine multiple strategies depending on the use case and the context budget. So the first technique is reshape and fit. We can apply context trimming, compaction and summarization. The second one is isolate and route. we can offload context and tools to specific sub aents with a selective handoff and the last bucket is extract and retrieve we can talk about memory extraction state management and memory retrieval in that last bucket when we talk about context engineing it's essential to distinguish between short-term and long-term memory because they solve very different problems I think we group the first two bucket as short-term memory which we also call as insession techniques and then the last bucket is long-term memory we call it cross session. So that means you can collect different information from multiple sessions and you can retrieve back in the next session or other sessions in the future. So short-term memory is all about making the most of the context window during an active interaction and active conversation.

**** · and then in contrast long-term memory is about building continuity across the sessions. Cool. So we often get excited about how powerful our agents are becoming how AI models are getting them better and better. They can handle complex tasks route between tools. They can plan multi-step workflows. But underneath all that there is a there is a core bottleneck because context is finite.

**** · So every piece of information we add to the prompt instructions conversation history tool outputs competes for a space in a fixed token budget.

**** · And this is why it matters slide. I want to make the problem concrete here. So I'll frame it around before and after contrast. So you see two conversations. what happens without memory on the left and what happens with memory on the on the So on the left hand side the user started with the issues Wi-Fi, battery and over overheating in it troubleshooting agent.

**** · After many turns the agent has forgotten the earlier context. it falls back to reasking for information that the user already gave. But on the hand side, the agent remembers the original issues even after many turns. It can pick up the unresolved thread. it references previous actions firmware update, background sync which makes it feel intelligent and reliable. So this is such a stateful behavior which is the foundation of a longunning agent.

**** · Now I'll switch gears to failure mode. So we can group these failure modes into four categories. The first one is context burst. So you can imagine it as a sudden token spike in one of or multiple components to do the limited external control or increased calls. Context conflict if there's any contradictory instructions or information in your context. Context poisoning if there's an incorrect information that enters the context and propagates over the turns. It can be via summaries or memory objects, state objects you're injecting into the context. And then finally context noise.

**** · So you can imagine it as multiple tool definitions or way more many tool definitions coming into your context at the same time. This can be redundant or overly similar items. So that can make a noise in the context.

**** · here's a nice visualization of context burst in tool heavy workflows. So you'll see that there will be a specific increase in one specific turn and you'll be injecting large amount of tool tokens here.

**** · And then the next one is context conflict. So we can easily visualize it here. So you can imagine in one of the turns there's specific tool call and here in this tool call you see that in the system instructions I never issue a refund if warranty status is not active.

**** · But in the middle of the turn you're also saying that it's eligible refund for VIP customers. But at the end of the turn and your agent is responding, hey, given your urgent travel, I can issue a full refund. So, this is a nice vis visualization or example for the specific context conflict that can be coming from one of the tool results. And then last one is context poisoning.

**** · so you can imagine it as a hallucination or something inaccurate mixed into the context in any step and propagates across different terms. So here we have a couple possible pitfalls here. losses summarization edits can be causing this.

**** · If you're using a free form note that accumulates over time that can be contradict and then finally older summaries override the newer ones and you'll be causing any hallucinations because of the summary logic and you'll be injecting that hallucination into the into the context and propagating over time. Cool. Now I'll stop sharing my screen and switch to the demo that I prepared for you.

**** · and then I'll go over all of these some of these challenges to show you how it works in a real world scenario. Okay, let me share my screen.

Context Lifecycle Demo

**** · Cool.

**** · So here I prepared a demo app for this build hour. It is an ID troubleshooting agent for software solving issues for issues related to both software and hardware. and this is a dual agent demo that lets you run two agents side by side. So backend logic states inside the nextJS app and I'll be using OpenAI agents SDK.

**** · So we have two tools connected to that agent. One of them is get orders and other one is get policy. So here I can start sending a message and saying hi to both of the agents, and then you'll see that both of the agents are responding to my message and then here I can say hey my laptop fan is making weird noises while I playing games. Is it normal? So here you see that the configuration for both of the agents are same. They're at the same model, same reasoning level. but I'm sending the same message and there is no memory there is no memory configuration yet. So here you also see the context usage bars at the at the top. So you're you're seeing different type of components that is already in the context now. and I can say hey before I want to see my orders my order number is 1 2 3 four five and so now I'm expecting the model to make a tool call and show me the orders I have. So here you see that the order status the items I have and it's powered by a specific tool called. So as you see over time it will accumulating different tokens and different type of tokens here and [clears throat] in the context life cycles I'm visualizing what is happening under the hood across multiple turns. So here you see that I have 84 tokens of system instructions. My user input is increasing slowly. but the core component here is agent output that will be generated by a model.

**** · Cool. so this is a typical real world scenario. So now I also want to showcase how context burst is happening.

**** · So I can still start with high and I can say hey this time I'm having an overheating issue on my laptop and then the model is responding to my to my message to my issue and then here it's telling me specific instructions and it's asking me some questions to better understand what's happening and then I can say hey thanks before that I want to see the refund policy of my MacBook Pro 2014.

**** · So while I'm sending this message, I also want to quickly show you the code and core concept of or how it's working. So it's powered by open agent SDK and here you see the agent definition. So it's a customer support assistant. I'm having specific instructions here that I'm adding. I'm using different models here. and also I can show you the system prompt really quickly instructions.

**** · so it's I'm I'm saying that hey you're a customer support assistant for devices. and then I'm using very slight prompting and instructions for that specific agent here. So let's go back to the response. So, since I asked about the specific refund policy for from a MacBook Pro 2014, it's made a tool call get refund and it's returning a specific refund policy that I added before.

**** · So, here you see that in between turn two and turn three there is a specific spike in the in the context window.

**** · So, in turn two, I I had maybe around 300 400 tokens, but now I have more than 3,000 tokens because I am just dumping lots of information into the context. And then this is a nice example for context burst here. instead of maybe just dumping all of these information into the context as a refund policy, I can be more careful about my tool definitions and tool outputs to make a decision about what should I inject into the context. So maybe not all of these information are valuable, but as you see that I'll be injecting lots of information into the context that's visualized here in this context life cycle tab.

**** · Cool. Now I'll stop sharing my screen and go back to the deck. So I'll continue with the next steps here.

**** · Okay, nice. So, we talked about challenges and what's going on under the hood and a specific example about context. So, now let's talk about the solution, so the solution is managing context efficiently using different techniques such as tming compaction state management memories and make the natural step beyond prompt engineer.

**** · again this is also another visualization about different components in the context. so you see that across the times it's increasing the token counts are increasing and these tokens can be coming from the system message user message maybe you might be injecting memories or maybe you might be injecting different type of specific tokens that can be added into your context.

**** · here I want to group AI agents in terms of context profiles. So we can group them into three categories. So first one is rack heavy assistance. So you can imagine reports policy QA agents. in these type of agents context is most dominated by retrieved knowledge and citations. The second one is tool heavy workflows. context is mostly dominated by frequent tool calls and returned payloads. And last one is conversational concier. So you can think about planning agents, coaching agents.

**** · and in this case context is mostly dominated by growing dialogue history. There'll be lots of tokens in conversation history assistant usage tokens that scales with with session lengths and then to better understand the solution and the techniques we can go over what is fixed in our context and what is dynamic and a variable. So here you see different type of components usually system instructions tool definitions and examples unless you're doing a rack approach is mostly static in the context. What is dynamic is tool results retrieved knowledge memories and conversation history. So these are the nice examples for dynamic u and static context and tokens. so you have a control on dynamic tokens and you can apply different techniques to control it efficiently.

**** · So I would to start with prompting best practices to aid avoid context conflict here. you can also find it from our prompting guides and cookbooks. the first rule is being explicit and structured. We suggest you to use clear direct language and specific enough to guide action. you should give room for planning and self-reflection. I think this is becoming more and more important with reasoning models GPT5. and you should avoid conflicts. So keep the tool set small and non-over overlapping.

**** · Don't use ambiguous definitions. Even if a human can pick a tool, the model won't either. So be careful with conflicting instructions and tool definitions. So for context noise, we talked about many tool definitions and many tools attached into the context as an example situation.

**** · again, you should be explicit and structured in your prompts. more tools isn't always equal to better outcomes. so favor targeted tools with clear tool decision boundaries and then return meaningful context from your tools. So in the demo you show that specific example of context burst. So we suggest you to control what is the tool output and then return high signal semantically useful fields and prefer human readable identifiers.

**** · Nice. So now I'll switch gears to engineering techniques and I'll start with the first one which is reshape and fit. So the first technique here is context trimming. So it is a pretty basic technique. It means that dropping older turns while keeping the last end turns. here in the turn we have limited context. it's getting noisy. There are lots of information in the context. It can be coming from a tool user message or different sources. And there's higher likelihood of losing track because we are getting close to context limit.

Context Engineering Techniques

**** · But once we trim the older conversations, older messages now we have fresh context. It has better attention and you'll see that it will also increase the latency that you're using. So it keeps the last end messages and trips the previous older messages. and these are the some parameters and knob we have control over in context trimming.

**** · The second technique is context compaction. it means just dropping tool calls or tool call results from the older terms while keeping the rest of the messages. So if you have a tool heavy agent, you can consider these techniques. you'll see that your context will be most dominated by tool results. It will be noisy. There will be maybe some context noise and lots of information coming from different tools.

**** · And after compaction you'll see that there will be fresh context better attention and faster processing and you will also be keeping the tool placeholders intact after even after the context compaction here and a question you might have would be okay how can I decide heristics about trimming and compaction so here I can share a couple suggestions so first you can you can analyze your sessions you can collect context snapshots from production or from your users. You can collect times down and dislike context to see what's going wrong there. Think about the average token size of a context. what type of task do you have in one session? Secondly, do not trim mid turn and break turn blocks. So a turn means that a user message and all the other message until the next user message, so if you just break or don't respect these turns there will be higher likelihood of losing track. And then finally don't wait to hit context window limits. so keep track of context allocation. You can set thresholds 40 or 80%. So if you're getting closer to hitting the context window limits, these thresholds will help you to better understand when you should trigger some of these operations.

**** · You can control tool outputs and you can also keep track of token saves. So these techniques are also really nice for cost cost reducing cost purposes. and you can always keep track of how much token you're saving while you're increasing the overall capability of the of your agent.

**** · And then the next technique we have is context summarization. so it means compressing prior messages into structured summaries and you can inject into the context history.

**** · So here in turn n you see lots of messages noisy context again and you're keeping the last n messages and just summarizing or compressing the previous ones so that you'll have fresh context better attention faster processing and at the end of the day you'll have a golden summary. So this will be a valuable information because you'll be compressing all of the valuable informations. and at the end of today it will have a very dense object that you can keep track of and that will be also useful for you to better understand what happened in the conversation.

**** · And here's a nice visualization in the context life cycle about the summarization. So let's say you perform the summarization a specific turn. So you'll see that you are compressing all the previous information and injecting it back into as into the context as a memory object. So you see there is a new component called as memory after the summarization performed.

**** · Nice. so here is a comparison of summarization versus trimming. so there are different dimensions that you can consider while you're designing a memory pattern for your for your agent.

**** · you can see that in trimming you'll just keep last entrance and you'll be dropping the oldest ones. So it'll be pretty straightforward operation. So it'll be very fast and there's no latency. but the trade-off is that you might be losing some information that is already there. So I think this is the main trade-off that we have. it's it's really best for tool heavy ops and short workflows.

**** · And then in summarizing, you're just keeping track of all the information. So you're not throwing away any anything. this can add this can add a little bit of latency and cost because you'll be doing another summarization call to a model. but you'll be just collecting all the information. So you can think about a an agent or use case. If you have multiple tasks in your longunning agent that are independent from each other, you can definitely consider u trimming because probably the trimmed or throwed away information is not important for the agent for the next turns. But if you're collecting useful information across multiple turns and tasks are dependent to each other, then you can definitely consider summarization here.

**** · Nice. so now I'll stop sharing my screen and go back to the demo. I'll show you couple examples of these techniques that we already covered here. Let me quickly share my screen.

**** · Nice. So here let's go to configurations page on the demo. So for agent P I want to enable trimming and I can set max turns S3 to trigger trimming operation. I want to keep recent turns as S3. So here I can start again to test my agent and saying hi. So this time I want to understand the refund policy that I want to check. Maybe I want to refund the laptop I just bought. I can say, "Hey, I want this refund policy for my MacBook about a month ago. and I want to understand what's happening here in terms of refund. So if I'm eligible or if I'm not eligible. So the model is now making a tool call and calling the get refund tool. Here as you see it's returning that specific information. I have 30 return window for for returning that specific laptop.

Reshape + Fit Demo

**** · And I also want to check my order. So I changed my mind and I'm saying, "Hey, can we also check my order? My order number is 1 2 3 4 5." and I want to see if it's if it's on the way, if it's coming. So you see that the model is doing another tool call as get order.

**** · And now I want to switch gears. I can say, "Hey, thanks. I'm having an issue with the internet connection. So until this turn, you see that I have lots of tool tokens here in the context in context life cycle. So it's getting accumulating over turns and in that specific turn now it's telling me that hey let's sort it out. It's asking me a couple questions about my device and I can say hey I tried to load an internet page and still see 40 404 error. I share a couple important information about the situation and you see still see that the across the turns it's accumulating lots of lots of tokens and even agent output is increasing.

**** · Okay. So let's say it's still happening on Safari. and then this is the last message probably that I that I want to share about the current situation and I'm waiting to have more guidance and instructions here. And here you see that at the end of turn six the context is trimmed. if I go back to here to visualize what's happening. So when I hit the turn six, you see that it trimmed the context. So it removed all these tool outputs and tool tokens.

**** · So now I have a fresh context here. and then now I can continue talking about the same specific issue or I can continue to talk about different information.

**** · Nice. So let's go back and then here I also want to show you how summarization works. So also the compaction is also works in a similar way as a trimming. So here I can set compaction trigger as for and keep recent turns too. So you'll see that at the end of turn two, it'll be compacting and removing all these tool outputs and I'll have a fresh context similar to the trimming approach.

**** · But now I want to be a little bit more advanced here. So here I want to enable summarization. I want to set the summarization trigger as five and I want to keep the recent three turns. So here I clicked save and now I want to just see how it's summarizing all these information. So here I'm sending hey I'm having internet connection again. So this time I decided to share more and more information about my situation.

**** · So I can say where I bought this computer what is the model. So I can say, "Hey, I have a 2014 MacBook Pro 14inch and I live in the US, but I bought it from Amsterdam. I received it from battery change service and just updated the OS version last week and they asked me to update the OS version to Mac me OS Seoia.

**** · So as you see, I'm sharing many information and these informations are really available for an IT trouble troubleshooting agent, and here I can go back and say okay these are nice clarify the problem what I need from you and I can say hey I already tried hard reset after checking the FAQ docs but it didn't work. So this is still in a form of memory because I'm sharing which steps I tried which worked and which didn't work. I can go back and continue talking with the agent. So now it's reasoning itself and providing me more detailed guidance and instructions for specific to a MacBook. So I go back I went back to my computer and I saw that the Wi-Fi icon is not active. and then I'm thinking maybe it's related to Wi-Fi or maybe it's related to a specific software issue. So, as you see across the turns, it's it's getting more complex. The the agent needs to reason and the agent also needs to keep track of what's in its context and make sure that it's not there's no burst, there's no conflict. there's no poisoning and other type of failure modes. It's telling me some specific steps and I can say, "Hey, I tried it already and I'm wondering if it's a specific software issue, " and then here I'm waiting for the response from the agent. So again, I shared lots of information here. lots of available information. So now the agent knows my device, where I bought this device, which steps I tried. and what I what type of steps I performed before doing that.

**** · Cool. so now you see that it's responded to me a very well structured instructions given what you described. Wi-Fi icon is not active blah. And then here I see that the context is summarized and I notice that there is an orange component we count as memory item and the memory item is the summary that we had.

**** · So here between turn four and turn five you see that I'm I'm condensing some part of the context and I'm injecting it back as a user message as a memory component. So again here memory is the summarized context from the previous terms.

**** · So now I want to go back to the code and show you the summary prompt and go over some important topics about this specific prompt. So as you see here's my summary prompt. I'm saying hey you are a senior customer support assistant for tech devices setup and software issues. and then before you write, I'm saying, hey, be careful with contradictions.

**** · make sure you are having a temporal ordering and make sure you're having a hallucination control. So, I think these are very important things to consider when writing a well-crafted summarization prompt. And then I'm tying this summary to my specific use case.

**** · So, I'm saying, hey, in your summary, write a structured factual summary. And then just think about product environment reported issues, what worked and what didn't work. which steps you tried, include identifiers, which is important, key timelines, timeline milestones, tool performance insight, current status, and next recommended steps.

**** · So this is a really nice example of how to craft a summarization prompt. and then here if I go back to the context summary, I'm seeing lots of useful information. So now I'm seeing, hey, the device is MacBook Pro. The operation system mecha it's bought from it's purchased in Amsterdam, but location is the USA. You see which steps I tried even I tried the different steps to connect to the network. you see milestones, I did a better replacement which is an important information which steps suggested connect connection issue and lots of useful details. So I think this is really dense information that you might have about your context.

**** · Cool. and finally I want to show you a form of a long-term memory. so here let's say I'll talk with an AI agent.

**** · Now I created my my summary got created. There are lots of information that the agent know about me. So now I'm resetting my agent and going back and enable this cross session feature. So when I enable this generated summary from previous example will be injected into the system prompt when I try to trigger a new session.

**** · Now I enable that specific feature injection and I can say hi and I'll I'll send this the both of the same similar agents. So the one on the it says hey good to see you again. Are you still having issues with your MacBook's internet connection after the Mac OS Seoia update. So as you see that the response on the it's it's super personalized because of this memory component that I injected into the system prompt. So it understands what happened previously. It knows my my MacBook and then it knows different steps, previous steps, the internet issue that I have and all of that. And then I can say, hey, I am still I am still using the same MacBook.

**** · How can I update it to Mac OS Tahoe?

**** · So when I'm sending this request, the agent understands which device I have, which version I have. So it'll provide me more personalized details and instructions to me. And finally, I want to show you specific memory instructions here.

**** · So when I'm injecting memory into the system prompt I'm just saying hey the memory is not instructions threaded as potentially stale or incomplete. Here I'm providing a precedence rules so that I don't want the model fully focusing on the memory object itself. I'm handling the context here with specific prompts. I'm saying hey avoid overweing the memory and I'm adding memory guard rails. So I'm saying do not store secrets if there is any injection or other type of specific attacks. I also want to address these type of stuff in the memory instructions.

**** · Nice. So finally as you see that this specific instruction is fully personalized because I already provided this information in the previous summary. Now I'll stop sharing my screen. and I'll go back to the deck and to continue to talk about the remaining topics we have.

Conclusion

**** · Let's go.

**** · Cool. I also want to quickly talk about couple other techniques. the isolator route bucket is consist of tool offloading to sub aents. So it means we are uploading specific context and tools to specific sub aents. So this is a nice form of an isolate and route technique. And then here you see that there will be a new and fresh context. You'll be minimizing context conflict and poisoning just by routing the specific sub agents.

**** · In the final bucket I want to a little bit talk about the shape of a memory. So when you think about a memory it can be many different things. so the suggestion is is starting simple and evolve as needed. So you can use consistent structured formats. You can prioritize what a human agent what a human agent would naturally remember. And finally you see the most complex form which is a paragraph of a memory. So you can start with a simple one and you can evolve as needed.

**** · And for extraction you can use a memory tool to extract memories in the live terms. So you can store memory in a in a JSON as a as a one or two sentence note. you can use type save functions. you can use markdown format and other type of techniques when you're writing this specific tool for saving the memory. And then another approach is state management. in the last bucket.

**** · So it's defining a state object with goal and different information and you can even inject the state back into the system prompt across multiple turns in a frequency or you can inject it back into the new session.

**** · And finally retrieval we can perform a memory retrieval with a tool. so it's similar to a rag approach. So you can store these memories into a long-term store and a vector DB and during the live turns you can make a search filter rank and inject it back into the agents.

**** · Nice. So finally I want to wrap up. so I want to reiterate best practices in in agent memory design. So first one is understanding your typical context. and you should define what is meaningful for you and for your agent.

**** · The second point is deciding when and how to remember and forget. So you can promote stable reusable facts to memory and activity forget temporarily stale or lo confidence information and you'll see that your memories will be evolving over time and you can continuously clean merge and consolidate memories and you can optimize these steps in iterations and finally evals is also super important. So you can run your own evals to see if there is any improvement with memory on and off. You can even build your memory specific evals for long running task and long context.

**** · Awesome. With that, let's move on to some Q&A. We've had a ton of great questions come in. So why don't we re refresh the presentation and we'll pull up the next slide and get into a few.

Q&A

**** · Nice. Okay. So let me go back to the Q&A session and jump into the questions we have. So let me quickly share it again.

**** · Nice.

**** · Okay.

**** · Yeah. Let's start with the first questions. Yeah. So there are li any libraries or packages to recommend for context engineering. So this demo is built by using open agents SDK package and library. It gives you a really good flexibility to implement your own sessions. and in these sessions you can easily implement trimming, compaction, summarization and all of that type of techniques easily here.

**** · so I see many different libraries that are evolving really fast to make your life easier for context engineering. So as you see we have too many techniques and each technique has different parameters to tune. So I also see that there is an evolving part of the all of these libraries but I can suggest open agents SDK as a starting point to start implementing specific context engineering techniques and go from there.

**** · Nice.

**** · Next one. So how do you evaluate or measure the memory feature is evolving the memory is improving performance. So this is a really nice question. yeah after this the session you might think about hey I implemented a specific memory approach but I don't know if it's if it's good or not how it's performing well. So we can maybe split this into couple portions. So first one is just running your regular evals with memory and without memory. So I think this is a really nice way to start thinking about if memory feature works or not. So if you have some specific eval metrics completeness upwards downloads or that type of numeric metrics you can see if there's an increase or decrease or if there's any statistically significant uplift coming from the memory and then maybe your evals might not be capturing well that type of memory based boost or improvement then I suggest you to think about more memory based eval. So what by memory based eval is evaluating the model on long running tasks and long context. So if you are not hitting any context thresholds maybe your agent doesn't need any of these memory improvements at all. So again you can start with your core evals if you have already and then secondly you can start creating your own memory based evals. So you can even evaluate the quality of the summary, you can evaluate the injection time, you can evaluate the injection prompt. So there are different ways to evaluate it. but of course in most EVAs you also might need to prepare a golden data set first and think about maybe couple 50 examples golden examples of a of a good summary or you can try different horistics that I mentioned before to find the balance of trimming and compacting. So I think we can just group this into three different buckets. First one is running your own evals to see if there's an uplift. Second one is building memory specific evals. And the third one is finding the heristics and parameters to apply in the in the context engineering techniques.

**** · Next one. So should we use hierarchical context entire project context for immediate task and context for immediate file edit in questions? so yes the qu the answer is yes but it's mostly dependent on the use case. So we also have a concept called memory scope. So you can think about this memory scope as a global scope that means if you have a customer or user of your agent probably there are some information that you should always remember about that specific user. Maybe this user likes more friendly tones. Maybe this user lives in the US. So these are some examples for global memory. but you can also have a scope based on the specific session. So let's say I want to book a travel and then this time I prefer window seats because I want to sleep. so this also a nice example about the session scope and session memories. So I think this is a good practice to maybe separate these into two buckets and you can keep track of session memories with session scope and over time you can graduate session memories into global memories and you can keep track of what is really important about the specific user. So in travel concier example if user is always saying hey this time I want window seat maybe multiple times and you can finally graduate that memory into global memories and keep track keep it in agent's mind and remember that for the next next bookings.

**** · Nice. Okay. So what strategies do you do you use to keep memory flash or prune so the agent doesn't become overloaded with stale or yeah this is a really this is another good question and in the real world you see that memories are evolving really fast so after some time you'll see that there are some memories that you need to prune and the agent needs to forget. So in that specific case there are a couple techniques to apply. So first of them first of it is keeping a temporal tax. Okay, I learned this memory from the user but I learned it maybe two months ago. So if you can keep track of these timestamps or temporal tags, the model will understand what is old and what is new. So if I say I dogs, if I said I dogs two months ago and today I say I cats. So you'll see that the model is going to understand my favorite animal now is maybe cats and it will override the memory with the instructions.

**** · So this also falls into a little bit to memor consolidation.

**** · So how to prune stale memories, how to update and override the new ones into the existing ones. So temporal text is one technique that you can apply. the other one is you can use a way decay or a window function and you can focus more on the recent memories and you can downgrade the oldest ones. So it really depends on the nature of your use case. So if you think that what I said a year ago is not important for your agent, you can definitely prune these old ones and implement a weighted average probably for all your memories. But if you think that all of this memory is equally important for your agent, then you can consider maybe memory consolidation and memory override with temporal text.

**** · So we can talk about two different techniques to manage the overloaded and stale memories.

**** · Nice. Okay. So how do you manage scaling agent memory systems when you have many users with individual and shared memory pools?

**** · Yeah, this also another good example from real world. so once you see the memories are evolving over time and you'll see that you're collecting tons of memories from from your users. so there are different ways to scale it. I think the first path or first decision criteria starts with if you are performing a retrieval or search base long-term memory approach or you're just using summarizing the context. So if it's the second one that means you're just storing all of this information and persisted into a disk. So you can think about some scaling methods about data management how to manage a large amount of memory nodes as a text in a text format or you can think about scaling the first approach which is you have to think about how to scale a search and retrieval system. you might be storing all of this information into a vector database and then in this vector database you can try to scaling the storage you can scale all these all these vectors filtering ranking system and all of that so I think the first bucket is mostly about this long-term memory so we talked about memory as a tool so if you can think about extracting memories with a tool and retrieving back during the live turns probably this is the situation where you're going to hit this question about scaling for many users. in this case you can think about u scaling techniques for vector databases. you can use shorting you can optimize your embeddings model probably if you're using customized embedding model.

**** · and you can optimize retrieval process similar to a rag approach. again the first one is mostly about scaling a retrieval system. the second one is mostly about data storage, how to store specific data, how to manage tons of information and sentences.

**** · I think to wrap up we can put it into two buckets. One of them is scaling and optimizing a retrieval system. the second one is also u making more efficient for storing and persistent in the disk.

**** · So this is also a common question that I hear from my customers. I think you can maybe follow a pilot approach and you can turn on this new memory techniques for for a subgroup of your users and you can think about okay how it's evolving over time. Maybe you'll see that most of the memories that your users are saying are pretty limited. So think about this travel concier agent. So probably I'll just sharing my memories about my seat preference. Maybe if you want to book a hotel, I maybe higher floors. Maybe I the specific menu or breakfast.

**** · so I think this is more limited type of groups type of memory possibilities I can say and that type of agent. But you are if you're building a life coach or life coach agent. So there are tons of memories that you need to remember about me my life. and you'll see that these type of memories and memory pools are evolving really fast. So yeah the third point is that try to understand the evolution of memory and possibilities of memory in your AI agent. So we have two examples here travel concierge memories and then life coach memories. So yeah as you see in the second one you'll be collecting tons of information that is valuable for yeah for my life. and then my dreams, my goals, what I was thinking a month ago or a year ago. So the second one is mostly super advanced and complex and sophisticated memorable that requires lots of scaling for sure.

**** · Okay. so yeah, that was the end of the question the Q&A session probably.

**** · Yeah.

**** · Okay.

**** · yeah, and then we can just switch to resources. So this has been awesome. To wrap things up, we're we've linked a few great resources here, including the context engineering cookbook, which was referenced, and the context summarization cookbook and our agents Python SDK. I know we've gotten a lot of questions on is this available in GitHub. So you can explore all of these links on the and the full build hour repo is available on GitHub.

**** · so good news. We're likely going to squeeze one or two more of these in before the end of the year. So keep an eye out on our build hours page linked here. And a big thank you all so much for tuning in and a big thanks to Emmery who's did an amazing job with this session.

**** · Yeah, thanks everyone. we hope you enjoyed this build hour on an agent member patterns. I know we covered lots of different techniques lots of different information about memory, how to think about memory, how to design memory. So, so overall as you see there are too many options but the core idea is better understanding what your agent should remember and how it should remember and how it should forget. So you can think about these three things when you're designing your own agent u memory. and this is still an evolving field. So you might see some new features coming about memory overall. but yeah, I just wanted to show you different design tradeoffs and guide you with the with the best option. So finding the balance between these techniques are usually related to your specific use case. And then you can keep track of all of all the news and cookbooks in the resources section. So, I'll be also upload uploading this demo page so demo application to our build hours GitHub. and then yeah, thank you for your time and thank you for listening all of this.

**** · Yeah, have a great rest of your day and we'll see you next time.