Stanford Online - Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Transcript

**** · Hi everyone, welcome to another lecture for CS230 deep learning. Today we're going to talk about enhancing large language model applications and I call this lecture beyond LLM.

**** · it has a lot of newer content and the idea behind this lecture is you we started to learn about neurons and then we learned about layers and then we learned about deep neural networks and then we learned a little bit about how to structure projects in C3 and now we're going one level beyond into what would it look if you were building agentic AI systems at work in a startup in a company. and it's probably one of the more practical lectures. Again, the goal is not to build a product end to end in the next hour or so, but rather to tell you all the techniques that AI engineers have cracked, figured out, are exploring so that after the class, you have the breath of view of different prompting techniques, different agentic workflows, multi- aent systems, eval when you want to dive deeper, you have the baggage to dive deeper and learn faster about it. Okay, let's try to make it as interactive as possible as usual. when we look at the agenda, the agenda is going to start with the core idea behind challenges and opportunities for augmenting LLMs. So, we start from a base model. How do we maximize the performance of that base model? then we'll dive deep into the first line of optimization, which is prompting methods, and we'll see a variety of them. Then we'll go slightly deeper. If we were to get our hands under the hood and do some finetuning, what would it look I'm not a fan of fine-tuning, and I talk a lot about that, but I'll explain why. I try to avoid fine-tuning as much as possible.

**** · and then we'll do a section four on retrieval augmented generation or rag, which you've probably heard of in the news. Maybe some of you have played with rags. We're going to unpack what a rag is and how it works and then the different methods within rags and then we'll talk about agentic AI workflows. I'll define it. Andrew Ang is one of the call it first ones to have called this trend a agentic AI workflows and so we look at the definition that Andrew gives to agentic workflows and then we start seeing examples. The section six is very practical. It's a case study where we will think about an agentic workflow and we'll and I ask you to measure if the agent works and we brainstorm how we can measure if an agentic workflow is working the way you want it to work. There's plenty of methods called eval that's solve that problem. and then we'll look briefly at multi- aent workflow and then we can have a open-ended discussion where I'll share some thoughts on what's next in AI. and I'm looking forward to hearing from you all as well on that one. Okay, so let's get started with the problem of augmenting LLM. So open-ended question for you. You are all familiar with pre-trained models GPT3.5 Turbo or GPT40.

**** · What's the limitation of using just a base model? What are the typical issues that might arise as you're using a vanilla pre-trained model?

**** · Yes.

**** · Lacks some domain knowledge. You're perfectly you we had a group of students a few years ago was not LLM related, but they were building a an autonomous farming device or vehicle that had a camera underneath taking pictures of crops to determine if the crop is sick or not, if it should be thrown away, if it should be if it should be used or not.

**** · And that data set is not a data set you find out there. and the base model or a pre-trained computer vision model would lack that knowledge of course. What else?

**** · Yes.

**** · Okay.

**** · Maybe the you're saying so just to re repeat for people online you're saying the model might have been trained on high quality data but the data in the wild is not that high quality.

**** · And in fact, yes, the distribution of the real world might differ as we've seen with GANs from the training set and that might create an issue with pre-trained models. Although pre-trained LLMs are getting better at, handling all sorts of data inputs. yes, current lack what lacks current information. the LLM is not up to date. And in fact, you're Imagine you have to retrain from scratch your LLM every couple of months.

**** · one story that I found funny, it's from probably three years ago or maybe more, five years ago, where, during his first presidency, President Trump one day tweeted Kov Fe. You remember that tweet or no, just Kof. And it was probably a typo or it was in his pocket, I don't know. But that word did not exist. The LMS in fact that Twitter was running at the time could not recognize that word. And so the recommener system went wild because suddenly everybody was making fun of that tweet using the word kof and the LM was so confused on what does that mean where should we show it to whom should we show it and is an example of nowadays especially on social media there's so many new trends and it's very hard to restrain an LLM to match the new trend and understand the new words out there. you often times hear Gen Z words re or mid or whatever. I don't know all of them, but u you probably want to find a way that can allow the LLM to understand those trends without retraining the LLM from scratch. Yeah. What else?

**** · It's trained to have a breath of knowledge and if you wanted to do something specialized that might Yeah. It might be trained on a breath of knowledge but it might fail or not perform adequately on a narrow task that is very well defined. Think about enterprise applications that yeah enterprise application you need high precision, high fidelity, low latency and maybe the model is not great at that specific thing. It might do fine but just not good enough and you might want to augment it in a certain way. Yeah.

**** · [clears throat] So it makes the model a lot heavier, a lot slower.

**** · So maybe it has a lot of broad domain knowledge that might not be needed for your application and so you're using a massive heavy model when you are only using 2% of the model capability. You're perfectly You might not need all of it. So you might find ways to prune, quantize the model, modify it. All of these are are good points. I'm going to add a few more as well. LLMs are very difficult to control. your last point is an example of that. You want to control the LM to use a part of its knowledge, but it's not. It's in fact getting confused. we've seen that in history. In 2016, Microsoft created a notorious Twitter bot that learned from users and it quickly became a racist jerk. Microsoft ended up removing the bot 16 hours after launching it. The community was really fast at determining that this was a racist bot. and you can empathize with Microsoft in the sense that it is hard to control an LLM. they might have done a better job to qualify before launching but it is really hard to control an even more recently this is a tweet from Samman last November where there was this debate between Elon Musk and Samman on whose LLM is the left-wing propaganda machine or the -wing propaganda machine and they were hating on each other's LLMs but that tells you at the end of the day that even those two teams Grock and OpenAI, which are probably the best funded team with a lot of talent, are not doing a great job at controlling their LLMs, **** · And from time to time, if you hang out on X, you might see screenshots of users interacting with LLMs and the LLM saying something really controversial or or racist or, something that would not be considered great [laughter] by social standards, I guess. And and that tells you that the model is really hard to control.

**** · the second aspect of it is something that you've mentioned earlier. LLMs may underperform in your task. and that might include specific knowledge gaps such as medical diagnosis. If you're doing medical diagnosis, you would rather have an LLM that is specialized for that and is great at it. And in fact, something that we haven't mentioned as a group has sources. So the answer is sourced specifically. you have a hard time believing something unless you have the actual source of the research that backs it up. inconsistencies in style and format. So, imagine you're building a legal AI agentic workflow. legal has a very specific way to write and read where every word counts. if you're negotiating a large contract, every word on that contract might mean something else when it comes to the court. And so it's very important that you use an LLM that is very good at it. The precision matters. And then task specific understanding such as doing a classification on a niche field. Here I pulled an example where let's say a biotech product is trying to use an LLM to categorize user reviews into positive, neutral or negative. maybe for that company something that's would be considered a negative review typically is considered a neutral review because the NPS of that industry tends to be way lower than other industries let's say that's a task specific understanding and the LLM needs to be aligned to what the company believes is the categorization that it wants we will see an example of how to solve that problem in a second and then limited context handling. a lot of AI applications especially in the enterprise have require data that has a lot of context. just to give you a simple example, knowledge management is an important space that enterprises buy a lot of knowledge management tool. When you go on your drive and you have all your documents, ideally you could have an LLM running on top of that drive. You can ask any question and it will read immediately thousands of document and answer what was our Q4 performance in sales. it was X dollars. it finds it super quickly. In practice because LLMs do not have a large enough context, you cannot use a standalone vanilla pre-trained LLM to solve that problem.

**** · You will have to augment it.

**** · Does that make sense?

**** · [laughter] the other aspect around context windows is they are in fact limited. If you look at the context windows of the models from the last 5 years even the best models today will range in context window or number of tokens it can take as input somewhere in the hundreds of thousands of tokens max. Just to give you a sense 200,000 tokens is roughly two books. Yeah. So that's how much you can upload and it can read pretty much and you can imagine that when you're dealing with video understanding or heavier data files that is of course an issue.

**** · So [snorts] you might have to chunk it, you might have to embed it, you might have to find other ways to get the LLM to handle larger contexts.

**** · the attention mechanism is also powerful but a problematic because it does not do a great job at attending in very large contexts. There is an interesting problem called needle in a haststack. It's an AI problem where or call it a benchmark where in order to test if your LLM is good at attending at putting attention on a very specific fact within a large corpus, researchers might randomly insert in a book one sentence that can that outlines a certain fact such as Arun and Max are having coffee at Blue Bottle in the middle of the Bible, let's say or some very long text. and then you ask the LLM what were Arun and Max having at Blue Bottle and you see if it remembers that it was coffee. It's a complex problem not because the question is complex but because you're asking the model to find a fact within a very large corpus and that's complicated.

**** · So again this is a limiting factor for LLMs. We'll talk about rag in a second but I want to preview the there is debates around whether rag is the long-term approach for AI system. So as a as a high level idea a rag is a mechanism if you will that embeds documents that an LLM can retrieve and then add as context to its initial prompt and answer a question. It has lots of application knowledge management is an example. So imagine you have your drive again but every document is compressed in representation and the LLM has access to that lower dimensional representation. the debates that this tweet from Yu outlines is in theory if we have infinite compute then rag is useless because you can just read a massive corpus immediately and answer your question. But even in that case, latency might be an issue. Imagine the time it takes for an AI to read all your drive every single time you ask a question. It doesn't make sense. So, rag has other advantages beyond even the accuracy. On top of that, the sourcing matters as well. So, it might rag allows you to source. We'll talk we'll talk about all that later. But there are there's always this debate in the in the community whether a certain method is future proof because in practice as compute power doubles every year let's say some of the methods we're learning now might not be relevant 3 years from now we don't know essentially [snorts] and the analogy that he makes on context windows and why rag approaches might be relevant even a long time from Now is search. when you search on a search engine you still find sources of information and in fact in the background there is very details traversal algorithms that rank and find the specific links that might be the best to present you. versus if you had to read imagine you had to read the entire web every single time you're doing a search query without being able to narrow to certain portion of the space that might again not be reasonable.

**** · Okay, when we're thinking of improving LLMs, the easiest way we think of it is two dimensions. One dimension is we are going to improve the foundation model itself. So for example, we move from GPT 3.5 Turbo to GPT4 to GPT40 to GPT5. Each of that is supposed to improve the base model. GPT5 is another debate because it's packaging other models within itself. But if you're thinking about 3.5 4 and 4, that's really what it is. The pre-trained model improves and so you should see your performance improve on your tasks. but the other dimension is we can engineer leverage the LLM in a way that makes it better. So you can prompt simply GPT40.

**** · You can chain some prompts and improve the prompt and it will improve the performance. It's shown you can even put a rag around it. You can put an agentic workflow around it. You can even put a multi- aent system around it. And that is another dimension for you to improve performance. So that's how I want you to think about it. Which LLM I'm using and then how can I maximize the performance of that LLM.

**** · This lecture is about the vertical axis. Those are the methods that we will see with it together.

**** · Sounds good for the introduction. So let's move to prompt engineering. I'm going to start with an interesting study just to motivate why prompt engineering matters. there is a study from HPS, UPEN as well as Harvard Business School and and others also involve Wharton that took a subset of BCG consultants, individual contributors, split them into three groups. One group had no access to AI.

**** · One group had access to I think it was GPT 4. and then one group had access to the LLM but also a training on how to prompt better. and then they observed the performance of these consultants across a wide variety of tasks. There's a few things that they noticed that I thought was interesting. One is something they call the JAG frontier. meaning that certain tasks that consultants are doing fall beyond the jack frontier. Meaning AI is is not good enough. It's not it's not improving human performance. In fact, it's making it worse. and some tasks are within the frontier, meaning that AI is significantly improving the performance, the speed, the quality of the consultant. many tasks fail within and many tasks fail without and they shared their insights but the TLDDR is there is a frontier within which AI is absolutely helping and one where they call out this behavior of falling asleep at the wheel where people relied on AI on a task that was beyond the frontier and in fact it ended up going worst because the human was not reviewing the outputs carefully enough. they did note that the group that was trained was the best better than the group that was not trained on prompt engineering which also motivates why this lecture matters. so that you're you're you're within that group afterwards. One other insights were the centurs and the cyborgs. They noticed that consultants had the tendency to work with AI in one of two ways and you might yourself find be part of one of these groups. The centaurs are mythical creatures that are half human, half I think half what? Horses. Yeah, horses. Half horses, half something. and those were individuals that would divide and delegate. They might give a pretty big task to the AI. So imagine you're working on a PowerPoint, which consultants are known to do. you might write a very long prompt on how you wanted to do your PowerPoint and then let it work for some time and then come back and it's done when others would act as cyborgs. Cyborgs are fully blended bionic human robots, a human robot and robot augmented with robotic parts. and those individuals would not delegate fully a task. They would work super quickly with the model and back and forth. I find that a lot of students are more working cyborgs than centaurs. But while maybe in the enterprise when you're trying to automate a workflow, you're thinking more a centur.

**** · Yeah, that's just something good to keep in mind. Also, a lot of companies would tell you, oh, we're hiring prompt engineers, etc. It's a curer. I don't buy that. I think it's just a skill that everybody should have. you're not going to make a career out of prompt engineering, but you're probably going to use it as a very powerful skill in your career.

**** · so let's talk about basic prompt design principles. I'm giving you a very simple prompt here. Summarize this document and then the document is uploaded alongside it. And the model has not much context around what should be the summary, how long should be read the summary, what should it talk about. etc. You can improve these prompts by doing something summarize this 10page scientific paper on renewable energy in five bullet points focusing on key findings and implications for policy makers. That's already better, **** · You're sharing the audience and it's going to tailor it to the audience.

**** · You're saying that you want five bullet points and you want focus only on key findings. that's a better prompt, you would argue. how could you even make this prompt better? What are other techniques that you've heard of or tried yourself that could make this oneshot prompt better?

**** · Yeah.

**** · Okay.

**** · Example. So, say you mean here is an example of a great summary. Yeah, you're That's a good idea.

**** · to be someone act you are very popular technique act a renewable energy expert giving a conference at Davos let's say yeah that's great someone sounds you're really good at it you are the best in the world at this explain [laughter] yeah these things work it's funny but it does work to say act xyz it's a very popular prompt template But we'll see a few examples. What else could you do?

**** · Yes.

**** · I personally to say critique your own project.

**** · Okay.

**** · Critique your own project. So you're using reflection. So you might do one output and then ask it to critique it and then give it back. Yeah, we'll see that. That's a great one.

**** · That's that's the one that probably works best within those typically, but we'll see some examples. What else?

**** · Yeah.

**** · Breaks.

**** · Okay. break the task down into steps. Do how that is called?

**** · No.

**** · Okay.

**** · Chain of thoughts. So, this is a popular method that's been shown in research that it improves. You could give a clear instruction and also encourage the model to think step by step. Approach the task step by step and do not skip any step.

**** · And then you give it some steps such as step one identify the three most important findings. Step two, explain how key each finding impact renewable energy policy. Step three, write the five bullet summary with each point addressing a finding etc. So, chain of thoughts, I linked the paper from 2023 that popularized chain of thoughts. Chain of thoughts is very popular now, especially in AI startups that are trying to control their elements.

**** · Okay, [snorts] to go back to your examples about act XYZ, I what I to do, Andrew also talks about that is to look at other people's prompts and in fact in online you have a lot of prompt repositories for free on GitHub. In fact, I linked the awesome prompt template repo on GitHub where you have so many examples of great prompt that engineers have built. They said it works great for us and they published it online. And a lot of them starts with act as act as a Linux terminal, act [laughter] as an English translator, act a position interviewer, etc.

**** · The advantage of a prompt template is that you can put it in your code and scale it for many user requests. So let me give you an example from worker. where Cara evaluates skills some of you have taken the assessments already and tries to personalize it to the user and in fact if you read in an HR system in an enterprise in the HR system you might have Jane is a product manager level three and she is in the US and her preferred language is English and that metadata can be inserted in a prompt template that we personalize for Jane and similarly for Joe whose favorite language is preferred language is Spanish. It will it will tailor it to Joe and that's called the prompt template.

**** · Yeah.

**** · Foundation models they don't use something you have to.

**** · So the question is do the foundation models use a prompt templates or do you have to integrate it yourself? So the foundation models probably use a system prompt that you don't see when you type on chatgpt it is possible it's not public that openai behind the scene has act a very helpful assistant for this user and by the way here is your memories about your the user that we kept in a database you can check your memories and then your prompt goes under and then the generation starts. So probably they're using something that, but it doesn't mean you can't add one yourself. So in fact, if you if you think about a prompt template for the work here example I was showing, maybe it starts when you call OpenAI by act a head assistant and then underneath it's act a great AI mentor that helps people in their career and OpenAI's Chrome template also has follow the instruction from the creator or something that, it's possible. Yeah.

**** · U questions about prompt templates. Again, I would encourage you to go and read examples of prompts. Some of them are quite thoughtful. let's talk about zero shot versus few shot prompting. It came up earlier. Here's an example. Again, going back to the categorization of product reviews.

**** · Let's say that we're working on a task where the prompt is classify the tone of this sentence as positive, negative, or neutral. and the and then you paste the review which is the product is fine but I was expecting more.

**** · If I were to survey the room I would bet that some of you would say it's negative, some of you would say it's neutral because you have a first part that is relatively positive. It's fine and then the second part I was expecting more which is relatively negative. So where do you land? This can be a subjective question and maybe in one industry this would be considered amazing and another one it would be considered really bad because people are used to really flourishing reviews. And so the way you can align the model to your task is by converting that zero shot prompt. Zero shot refers to the fact that it's not beginning given any example into a few short prompts where the model is given in the prompt a set of examples to align it to what you want it to do. So the example here is you again you paste the same prompt as before with the user review and then you add here are examples of tone classifications. This exceeded my expectation completely.

**** · Positive. It's okay but I wish it had more features. Negative. The service was adequate. Neither good nor bad. neutral. Now classify the tone of this sentence after you've heard about these things. And the model then says negative. And the reason it says negative of course is likely because of the second example which was it's okay but I wish it had more features which we told the model that was negative because the model saw that it's aligned now with your expectations.

**** · Fshot prompts are very popular. and in fact, for AI startups that are slightly more sophisticated, you might see them keep a prompt up to date, whenever the a user says something and they might have a human label it and then add it as a few shots in the relevant prompt in their codebase. you can think of that as almost building a data set, but instead of building a separate data set we've seen with supervised fine-tuning and then train fine-tuning the model on it, you're just putting it directly in the prompt. And turns out it's probably faster to do that if you want to experiment quickly because you don't touch the model parameters. You just update your prompts. And if it's text examples, you can concatenate so many examples in a single prompt. At some point it will be too long and you will not have the necessary context window. But it's a pretty strong approach that is quick to align an LM.

**** · Okay. Yes.

**** · Research on how long can be until it starts with So question was is there any research on how long the prompt can be before the model essentially loses itself or doesn't follow instructions anymore?

**** · There is the problem is that research is outdated every few months because models get better. and so I don't know where the state-of-the-art is. You can probably find it online on benchmarks on we see that I I give an example on the worker product you have a voice conversation for some of you that have tried it where you're ask explain what is a prompt and then you explain and then there's a scoring algorithm in behind we know that after eight turns the model loses itself after eight turns because you always paste the previous user response it just starts going wild and so the techniques we use in the background is we create chapters of the conversation.

**** · Maybe one chapter is the first aid prompt and then you start over from another prompt. You can summarize the first part of the conversation, insert the summary and then keep going. those are engineering hacks that engineers might have figured out in the background. Yeah. Because yeah, eight turns makes a prompt quite long [snorts] let's move on to chaining.

**** · chaining is the most popular technique out of everything we've seen so far in in prompt engineering. it's not chain of thought. So chain of thought we've seen is think step by step, step one, step two, step three, do not skip any step. This is different.

**** · This is chaining complex prompt to improve performance. And this is what it looks you take a singlestep prompt such as read this customer review and write a professional response that acknowledges their concern, explains the issue, offers a resolution, and then you paste the customer review, which is, I ordered a laptop, it arrived 3 days late, the packaging was damaged, very disappointing. I need it that urgently for work. And then the output is an email that is immediately given to you by the LLM after it reads the prompt.

**** · so this might work, but it might be hard to control, cuz think about it. There's multiple steps that you have listed and everything is embedded in the same prompt. And if you wanted to debug step by step and know which step is weaker, you couldn't. You would have everything mixed together. So one advantage of chaining is you would you would separate the prompts so that you can debug them separately and it will also lead to an easier manner to improve your workflow. let's say a first prompt is extract the key issues. Identify the key concerns mentioned in this customer review. Paste a customer review. Second prompt using these issues. So you paste back the issues. Draft an outline for a professional response that acknowledges concerns, explains possible reasons, and offer a resolution.

**** · So, this is not, prompt number three, write the full response. So, using the outline, write the professional response and then you get your final output.

**** · So, in theory, you can't tell me, oh, the second approach is better than the first one at first. But what you can notice is that we can test those three prompts separately from each other and determine if we will get the most gains out of engineering the first prompt, optimizing it or the second one or the third one. We now have three prompts that are independent from each other. And maybe if the outline was better the performance of the email the email how much it will the open rate will be or the user satisfaction on the response will get higher and so chaining improves performance but most importantly helps you control your workflow and debug it more seamlessly.

**** · Yes.

**** · So [snorts] if we know that the three pump independently work very well, if we combine them into one pump and we highlight that stepbystep thinking process, do we does on average we get the same policy output or we still have to do that break down?

**** · So let me try to rephrase you say let's say we look at the first prompt which has all three task built in that prompt. what exactly do you mean? You mean if we evaluate the output and we measure some user insight, satisfaction, etc. why don't we just modify that prompt and essentially see how it improves user satisfaction?

**** · Yeah, instead process.

**** · I see. See, why do we need the three steps?

**** · Yeah.

**** · think about it. The intermediate output is what you want to see. if I'm debugging the first approach, the way I would do it is I would capture user insights. here's the email, how good was the response?

**** · Thumbs up, thumbs down. was your issue resolved? thumbs up, thumbs down. Those would tell me how good is my prompt? And I can engineer that prompt, optimize it, and I would probably drive some gains. but I will not be able easily to trace back to what the problem was. While in the second approach, not only I can use the end to end metrics to improve my process, I can also use the intermediate steps. For example, if I look at prompt two and I look at the outline and I see the outline is meh, it's not great, then I think I can get a lot of gains out of the outline. or the outline is really good, but the last prompt doesn't do a good job at translating it into an email. So the outline is exactly what I want the LLM to do, but the translation in a customerf facing email is not good. In fact, it doesn't follow our vocabulary internally.

**** · Then I know the third prompt is where I would get the most gains. So that's what it allows me to do. Have intermediate steps to review.

**** · Yeah.

**** · Are there any lat?

**** · We'll talk about it. Are there any latency concerns? Yes. in certain applications you don't want to use a chain or you don't want to use a long chain because it adds latency. We'll talk about that later. Good point.

**** · So practically this is what chaining complex prompts look You have your first prompt with your first task. It outputs the output is pasted in the second prompt with the second task being defined. The output is then pasted into the third prompt with the third task being defined and so on. That's what it looks in practice.

**** · Super. U we'll talk more later about testing your prompts, but there are methods now to do it and we'll we'll see later in this lecture with our case study how we can test our prompts. but here is an example of how you might do it. you might have a summarization workflow, prompt that is the baseline. It's a single prompt.

**** · You might have a refined summarization which is a modified prompt of this or a workflow with a chain, and then you have your test case, which is the input that you want to summarize, let's say, and then you have the generated output, and you can have humans go and rate these outputs. And you would notice that the baseline is better or worse than the refined prompt.

**** · Of course, this manual approach takes time. but it's a good way to start and usually the advice is get hands-on at the beginning because you would quickly notice some issues and it will give you better intuition on what tweaks can lead to better performance. However, if you wanted to scale that system across many products, many parts of your codebase, you might want to find a way to do that automatically without asking humans to review and grade summaries, one approach is to, use, platforms at Portera, we, our team uses a platform called Prompt Fu that allows you to automate part of this testing.

**** · in a nutshell, what it does is it can allow you to run the same prompt with five different LLMs immediately, put everything in a table that makes it super easy for a human to grade, let's say. Or alternatively, it might allow you to define LLM judges. LLM judges can come in different flavors. For example, I can have an LLM judge that does a pair-wise comparison.

**** · So, what the LLM is asked to do is here are two summaries. Just tell me which one is better than the other one. that's what the LLM does. And that can be used as a proxy for how good the summarization baseline versus the refined version is. Another way to do an LLMA judge is if you do it for a single answer grading. So here's a summary, grade it from one to five, and then you can go even deeper and do a reference guided pair wise comparison or you add also a rubric. You say a five is when a summary is below 100 characters.

**** · I'm just making up below 100 characters mentions at least three key points that are distinct and starts with a first sentence that displays the overview and then goes into the detail. That's a great summary number five out of five. Zero is the LLM failed to summarize and was very verbose let's say and so you put a rubric behind it and you have an LLM as just finding the rubric.

**** · Of course you can now pair different techniques. You can do a few shot for the rubric. You can give examples of a five out of fives, four out of fours, three out of threes because now multiple techniques. Okay, [clears throat] does that make sense?

**** · Yeah. Okay. So that was the second section on prompt engineering or the first line of optimization. Now, let's say you've exhausted all your chances for prompt engineering and you're thinking about touching the model, modifying its weights or fine-tuning it. In other words, I was telling you I'm not a fan of fine-tuning. There's a few reasons why.

**** · one, it requires substantial label data typically to fine-tune. Although now there are approaches that are getting better at fine-tuning that look more few shot prompting than fine-tuning. It's merging although one modifies the weight, the other doesn't modify the weights. fine-tune models may also overfeed to specific data. We're going to see a funny example losing their general purpose utility. So you might fine-tune a model and when someone asks a pretty generic question, it doesn't do well anymore.

**** · it might do well on your task. So it might be relevant or not. And then it's it's time and cost intensive.

**** · That's my main problem. And at work here, we don't we don't we steer away from fine-tuning as much as possible. because by the time you're done fine-tuning your model, the next model is out and it's beating your fine tuned version of the previous model. So I would steer away from fine-tuning as much as you can. The advantage of the prompt engineering methods we've seen is you can put the next best pre-trained model directly in your code. It will update everything immediately. Fine-tuning doesn't work that.

**** · [laughter] There are advantages though where it still makes sense. if the task requires repeated high precision output such as legal scientific explanation and if the general purpose LLM struggles with domain specific language. So, let's look at a quick example together, which is an example, from Ross Lazerovitz, I think it was a couple of years ago, September 23, where, Ross tried to do Slack finetuning.

**** · So, he looked at a lot of Slack messages within his company and he was I'm going to fine-tune a model that speaks us or operates us because this is how we work, this is the data that represents how people work at the company. and so it he went ahead and fine-tuned the model. gave it a prompt hey write a he was delegating to the model write a 500word blog post on prompt engineering and the model responded I shall work on that in the morning. and then he tries to push the model a little further and say it's morning now. and the model said, "I'm writing now. It's a.m. here.

**** · write it now." Okay, please. [laughter] Okay, I shall write it now. I don't know what you would me to say about prompt engineering. I can only describe the process. The only thing that comes to mind for a headline is how do we build prompt? it's a funny example for fine-tuning because it's true that it went wrong.

**** · he was supposed to think I want the model to speak us at work and it ended up acting people and not following instructions. So one example why I would steer away from fine tuning. Super let's talk about rags. rags is important. It's important to know out there and at least having the basics.

**** · It's a very common interview question by the way. If you go interview for a job, they might ask you to explain in a nutshell to a 5-year-old what is a rag and hopefully after that you'll be able to do it. so, we've seen some of the challenges with standalone LLMs.

**** · Those challenges include the context window being small, the fact that it's hard to remember details within a large context window, knowledge gaps, cutoff dates you mentioned earlier. The model might be trained up to a date and then it cannot follow the trends or be up to date. Hallucinations there are some fields think about medical diagnosis where hallucination are very costly. You can't afford a hallucination. even in education imagine deploying a model for the US youth education and it hallucinates and it teaches millions of people something completely wrong. It's a problem. and then lack of sources.

**** · a lot of fields love sources.

**** · research fields love sources. Education loves sources. Legal loves sources as well. And so the pre-trained LLM doesn't do a good job to source. And in fact, if you if you have tried to find sources on a plain LLM, it hallucinates a lot. It makes up research papers. It just lists completely fake stuff. so how do we solve that?

**** · with a rag. Rag integrates with external knowledge sources, databases, documents, APIs. It ensures that answers are more accurate, upto-date and grounded because you can update your document.

**** · Your drive is always up to date. ideally you're always pushing new documents to it. And when you query what is our Q4 performance in sales, hopefully there is the last board deck in the drive and it can read the last board deck. Yeah. [snorts] and more developer control. We'll see why rags allow for targeted customization without requiring the retraining of the model. In fact, you don't touch the model with rags. It's really a technique that is put on top of the model. So to see an example of a rag, this is a question answering application where we're in the medical field and a user is asking a query.

**** · What are the side effects of drug X?

**** · This is an important question. You can't hallucinate. You need to source. You need to be up to date. Maybe there is a new update to that drug that is now in the database and you need to read that. So you have to a rag is a great example of what you would want to use here. The way it works is you have your knowledge base of a bunch of documents. what you do is you use an embedding to embed those documents into lower dimensional representations.

**** · So for example if the document is a PDF a long PDF you might read the PDF understand it and then embed it. We've seen plenty of embedding approaches together. triplet loss etc. you remember. So imagine one of them here for LLMs is embedding those documents into lower representation.

**** · If the representation is too small, you will lose information. If it's too big, you will add latency, It's a trade-off. you will store typically those representation into a database called a vector database. There's a lot of vector database providers as out there.

**** · I I think I've listed a couple that are very common. No I haven't listed but I can I can share afterwards. the vector database is essentially storing those vector in a very efficient manner allowing the fast retrieval with a certain distance metric. So what you do is you also embed usually with the same algorithm the user prompts and you run a retrieval process which is essentially saying based on the embedding from the user query and the vector database find the relevant documents based on the distance between those embeddings.

**** · Once you found the relevant documents, you pull them and then you add them to the user query with a system prompt or a prompt template on top. So the prompt template can be answer user query based on list of documents.

**** · If answer not in the document say I don't know. That's your prompt templates where the user query is pasted. the documents are pasted and then your output should be what you want because it's now grounded in the document. You can also add to this prompt templates tell me the exact page, chapter line of the document that was relevant and in fact link it as well just to be more precise.

**** · Any question on rags? There's a simple vanilla rag.

**** · Yeah. Yes. document embeddings still retain information about what's down on what page and what paragraph?

**** · Question is, do the document embeddings still retain the information of the location of the information within that document, especially in big documents?

**** · great question. We'll get to it in a in a second because you're that the vanilla rag might not do a good job with very large documents. So let's say when you open a medication box and you have this gigantic white paper with all the information and it's very long maybe a vanilla rag would not cut it. So what people have figured out is a bunch of techniques to improve rags and in fact chunking is a great technique that is very popular. So you might store in the vector database the embedding of the full document and on top of that you will also store a chapter level vector and when you retrieve you retrieve the document you retrieve the chapter and that allows you to be more precise with the sourcing. It's one example.

**** · another technique that's popular is hide hypothetical document embeddings where a group of researchers published a paper showing that when you get your user query one of the main problem is the user query does not look your documents. For example, the user query might be what are the side effect of drug X when in the document in the vector database the vectors represents very long documents. So how do you guarantee that the vector embedding is going to be close to the document embedding? What they do is they use the user query to generate a fake hallucinated document.

**** · They embed that document and then they compare to the vector in the vector database. Does that make sense? So for example, the user says, "What is the side effect of drug X?" There's a prompt that this is given to another prompt that says, "Based on this user query, generate a five-page report answering the user query." It generates potentially a completely fake u answer. You embed that and it will be closer to the document that you're looking for likely. Yeah, it's one example of a of a rag approach.

**** · Again, the purpose of this lecture is not to go through all these three and explain you every single methods that has been discovered for rags, but I just wanted to show you how much research has been done between 2020 and 2025 in rags and how many branches of research you now have that you can learn from. The survey paper is linked in the slides, by the way, and I'll share them after the lecture.

**** · [laughter] Super.

**** · So, we've made some progress.

**** · Hopefully now you feel if you were to start an LLM application, how to do better prompts, how to do chains, how to do fine-tuning, you also know how to do retrieval and you have the baggage of techniques that you can go and read and find the codebase, pull the code, vi code it, but you have the breath. Now u the next set of topics we're going to see is around the question of how could we extend the capabilities of LLM from performing single tasks and hands with external knowledge to handling multi-step autonomous workflows. Yeah.

**** · And this is where we get into proper agent AI. So let's talk about agent TKI workflows towards autonomous and specialized systems. Then we'll talk about evals. Then we'll see multi- aent systems. And we'll end with a with a little thoughts on what's next in AI. So Andrew Wang coined the term agentic AI workflows. and his reason was that a lot of companies use let's say agents, agents, agents everywhere. Agents everywhere.

**** · If you go and work at these companies, you would notice that they mean very different things by agent. Some people have a prompt and they call it an agent. other people they have a very complex multi- aent system.

**** · They call it an agent. And so calling everything an agent doesn't do it justice. So Andrew says, "Let's call it agentic workflows because in practice it's a bunch of prompts with tools with additional resources, API calls that ultimately are put in a workflow and you can call that workflow agentic. So it's all about the multi-step process to complete the task.

**** · Also calling it agentic workflow allows us to not mix it up with what I called agent the last lecture with reinforcement learning because in RL agent has a very specific definition interacts with an environment passes from one state to the other has a reward and an observation. You remember that chart **** · So here's an example of how we move from a one-step prompt to a multi-step agentic workflow. Let's say a user queries a a product, what is your refund policy on a chatbot? and the response using a rag says refunds are available within 30 days of purchase. And maybe the rag can even look linked to the policy document.

**** · That's what we learned so far. instead an agenting workflow can function this. The user says, "Can I get a refund for my order?" And the response via the agentic workflow is the agent retrieves the refund policy using a rag. The agent then follows up with the users and says can you provide your order number? then the agent queries an API to check the order details and finally it comes back to the user and confirms your order qualifies for a refund. The amount will be processed in 3 to five business days. This is much more thoughtful than the first version which is vanilla, **** · So that's what we're going to talk about in the next couple of slides is how do we get from the first one to the second one. there are plenty of specialized agentic workflows online. you've heard and if you hang out in SF, you probably see a bunch of billboards, AI software engineer, AI skills mentor you've interacted with in the class to worker, AISDR, AI lawyers, AI, specialized cloud engineer.

**** · it would be a stretch to say that everything works, but there's work being done towards that. Yeah, I'm not personally a fan of putting a face behind those things. I think it's gimmicky and I think in a few years from now very few products will have a human face behind it. but might be a marketing tactic from some startups.

**** · It's more scary than it is engaging frankly. okay, I want to talk about the pirating shift. that's especially useful. Let's say you're a software engineer or you're planning to be a software engineer because software engineering as a discipline is shifting or at least the best engineers I've worked with are able to move from a deterministic mindset to a fuzzy mindset and balance between the two whenever they need to get something done. So here's the paring shift between traditional software and agent TKI software. The first one is the way you handle data. Traditional software deals with structured data. You have JSONs, you have databases. They're pasted in a very structured manner in a data engineering pipeline and then they're used to be displayed on a certain interface. The user might feel fill a form that is then retrieved and pasted in the database. All of that historically has been structured data.

**** · Now more and more companies are handling free form text, images and all of that requires dynamic interpretation to transform an input into an output. the software itself used to be deterministic. Now you have a lot of software that is fuzzy and fuzzy software creates so many issues. imagine if you let your user ask anything on your website.

**** · The chances that it breaks is tremendous. The chances that you're attacked is tremendous. The chances, it's really, really complicated. It's more complicated than people make it seem on Twitter. [snorts] fuzzy engineering is truly hard. Yeah, you might get hate as a company because one user did something that you authorized them to do that ended up breaking the database and ended up, we've seen that with many companies in the last couple of years. So, it takes a very specialized engineering mindset to do fuzzy engineering, but also know when you need to be deterministic.

**** · the other thing I call is with agentic AI software, you you want to think about your software as your manager. So you're familiar with the monolith or microservices approaches in software where you structure your software in different boxes that can talk to each other and it allows teams to debug one section at a time now the equivalent with agentic AI is you think as a manager. So you think okay if I was to delegate my product to be done by a group of humans what would be those roles? Would I have a graphic designer that then puts together a chart and then sends it to a marketing manager that converts it into a nice blog post that then gives it to the performance marketing expert that then publishes the work the blog post and then optimizes and AB tests then to a data scientist that analyzes the data and then puts hypothesis and validates them or invalidates them. That's how you would typically think if you're building an agency AI software when the equivalent of that in traditional software might be completely different. It might be we have a data engineer box here that handles all our data engineering. And then here we have the UIUX stuff. Everything UIUIX related goes here. And companies might structure it in very different ways. And here's the business logic that we want to care about. And there's five engineers working on the business logic. let's say okay [snorts and laughter] testing and debugging is also very different and we'll talk about it in the next section.

**** · the other thing that I feel matters is with AI in engineering the cost of experimentation is going down drastically and so people I feel should be more comfortable throwing away code it's it's in traditional software engineering you probably don't throw away code a ton you build a code and it's solid and it's bulletproof and then you update it over time when we've seen AI companies be more comfortable throwing away codes. Yeah.

**** · Which has advantages in terms of the speed at which you move but also disadvantages in terms of the quality of your software that it can break more.

**** · No.

**** · Okay. So anyway, just wanted to do an aparte on the parading shift from deterministic to fuzzy engineering.

**** · oh, and I can give you an example from from worker that we learned probably over the last 12 months is if you if you've used worker you might have seen that the interface has asks you sometimes multiple choice questions and sometimes it asks you multiple select and sometimes it asks you drag and drop ordering matching whatever those are example of deterministic item types meaning you answer the question on a multiple le choice there's one correct answer it's fully deterministic. On the other hand you sometimes have voice questions where you go through a role play or you have voice plus coding questions where your code is being read by the interface or whatever. Those are fuzzy, meaning the scoring algorithm might make mistakes and those mistakes might be costly. And so companies have to figure out a human in the loop system, which you might have seen with the appeal feature at the end.

**** · So at the end of the assessment, you have an appeal feature where it allows you to say, I want to appeal the agent because I want to challenge what the agent said on my answer because I thought I was better than what the agent thought. And then you bring a human in the loop that then can fix the agent, can tell the agent, you were too harsh on the answer of this person.

**** · and that's an example of a fuzzy engineered system that then adds a human in the loop to make it more aligned. And so if you're building a company, I would encourage you to think about what can I get done with determinism and let's get that done. And then the fuzzy stuff, I want to do fuzzy because it allows more interaction. It allows more back and forth, but I need to put guard rails around it. And how am I going to design those guard rails pretty much? here's another example from enterprise workflows which are likely to change due to agent AI. this is a paper from McKenzie I believe from last year where they looked at a financial institution and they said that we observe that they often spend one to four weeks to create a credit risk memo and here is the process. A relationship manager gathers data from 15 and more than 15 sources on the borrower loan type other factors. Then the relationship manager and the credit analyst collaboratively analyze that data from these sources.

**** · Then the credit analyst typically spends 20 more 20 hours or more writing a memo and then goes back to the relationship manager. They give feedback and then they go through this loop again and again and it takes a long time to get a credit memo out and then run a research study where they changed the process. They said Genai agents could cut time by 20 to 60% on credit risk memos and the process has changed to the relationship manager directly work with the genai agent system provides relevant materials that needs to produce the memo. The agent subsizes the project into tasks that are assigned to specialist agents gathers and analyzes the data from multiple sources. Drafts a memo. Then the relationship manager and the credit analyst sit down together, review the memo, give feedback to the agent, and within, 20 to 60% less time are done. And so this is an example where you're not changing the human stakeholders, you're just changing the process and adding genai to reduce the time it takes to get a credit memo out. It turns out that imagine you're an enterprise and you have 100,000 employees and there's a lot of enterprises with 100,000 employees out there. you are currently under crisis in terms of redesigning your workflows. You are it turns out that if you pull the job descriptions from the HR system and you interpret them you also pull the business process workflows that you have encoded in your drive. you can find gains in multiple places and in the next few years you're probably going to see workflows being more optimized to add genai. even if that happens the hardest part is changing people. What we know this is this is great in theory but now let's try to fit that second workflow for 10,000 credit risk analysts and relationship managers. My guess is it will take years. It will take 10 20 years to get to this being done at scale within an organization because change is so hard so hard to rewire business workflows job descriptions incentivize people to do different and be different and train them and so this is what the world is going towards but it's going to take a long time I think okay then I want to talk about how the agent works and What are the core components of an agent? U imagine a travel booking AI agent.

**** · That's an easy example you've all thought about. I still haven't been able to get an agent to book a trip for me or I was scared because it was going to book a very expensive or long trip. But in theory you can you can have a travel booking agent that has prompts.

**** · So the prompts we've seen we know the methods to optimize those prompts. That travel agent also has a content management context management system which is essentially the memory of what it knows about the user. That context management system might include a core memory or working memory and an archival memory. Okay. What the difference is within memory is not every memory needs to be fast to access. think about it. you're born on a product and the first question is hi what's your name and I say my name is Keon that's probably going to sit in the working memory because the agent every time he's going to talk to me is going to want to use my name but then maybe the second question is Keon what's your birthday and I give it my birthday does it need my birthday every day probably not so it's probably going to park it on the long-term memory or the archival memory and those memories are slower to access they're farther down the stack and that structure allows agent to determine what's the working memory and what's the long-term memory and that makes it easier for the agent to retrieve super fast cuz think about it when you interact with GPT you feel that it's very personal at times you feel it understands you imagine every time you call it has to read the memories and that can be costly it's a very it's a very burdensome cost because it happens every time you talk to it So you want to be highly optimized with the working memory. if it takes 3 seconds to look in the memory, every time you're going to talk to your LLM, it's going to take 3 seconds, which you don't want. So anyway, and then you have the tools. The tools can include APIs a flight search API, hotel booking API, car rental API, weather API, and then the payment processing API. And typically, you would want to tell your agent how that API works. It turns out that agents or LLMs I should say are very good at reading API documentation. So you give it the API documentation and it reads the JSON and it reads what does a get request look and this is the format that I need to push and then it pushes it in that format let's say and then it retrieves something.

**** · Does that make sense? Those different components entropic also talks about resources. resources is data that is sitting somewhere that you might let your agent read. For example, if you're building your startups, you have a CRM. A CRM has data in it and you want to use lookups in that data. You will probably give a lookup tool and you will give access to the resource and it will do lookups whenever you want. Super fast.

**** · this type of architecture can be built with different degrees of autonomy from the least autonomous to the most autonomous and I'll give you a few examples less autonomous would be you've hardcoded the steps so let's say I tell the travel agent first identify the intent then look up in the database the history of this customer with us and their preferences then go to the write API blah blah then go to the I would hardcode the steps. Okay, that's the least autonomous. The semi autonomous is I might hardcode the tools but I'm not going to hardcode the steps. So, I'm going to tell the agent your act a travel agent and and you your task is to help the person book a travel and these are the tools that you have accessible to yourself. And so I'm not hard coding the steps. I'm just hard coding the tools that you have access to for yourself. the more autonomous is the agent decides the steps and can create the tools. So that's where you might give access to a code editor to the agent and the agent might be able to ping any API in the web, perform some web search. It might even be able to create some code to display data to the user.

**** · It might even be able to perform some calculations oh, I'm going to calculate the fastest route to get from San Francisco to New York. and which one might be the most appropriate for what the user is looking for. And then I want to calculate the distance between the airport and that hotel versus that hotel. And I'm going to write code to do that. So it's fully autonomous from that perspective.

**** · Okay.

**** · So yeah, remember those keywords, memory, prompts, tools, etc.

**** · Now I presented the flight API, but it does not have to be an API. you probably have heard the term MCP or model context protocol that was coined by anthropic. I pasted the seminal article on MCP at the bottom of this slide. But let me explain in a nutshell why those things would differ. in the API case you would teach your LLM to ping an API. So you would say this is how you ping this API and this is the data that it will send you back and you would have to do that in a one-off manner. So you would have to build or give the API documentation of your flight API, your booking hotel API, your car rental API and then you would give tools for your model to communicate with those APIs. it doesn't scale very well versus MCP. MCP it's really about putting a system in the middle that would make it simpler for your LLM to communicate with that endpoint.

**** · So for instance you might have an MCP server and MCP client where you're trying to communicate with that travel database or the flight API or MCP and your agent might just communicate with it and say hey what do you need in order to give me more flight information and that agent will respond by I would you to tell me where is the origin flight where is the destination and what you're looking for at a high level. This is my requirement.

**** · Okay, let me get back to you with my requirements. Oh, you forgot to tell me your budget, whatever. Oh, let me give you my budget, etc. and it's it's it's agentto agent communication, which allows more scalability. You don't need to hardcode everything. Companies have displayed their MCPS out there and you can your agent can communicate with them and figure out how to get the data it needs. Does that make sense?

**** · Yeah.

**** · Oh, sorry. rewriting any help it's suffering only changes in the API rather agent you can rewrite that yes is it not just sh is I think it is ultimately the question is isn't it chiefing issue because anyway if an API has to be updated the MCP has to be updated so what do you say yes that's correct but at least it allows the agent to go back enforce and figure out what the requirements are.

**** · But at the end of the day, ideally, if you're a startup, you have some documentation and automatically have an agent or an LLM workflow that reads that documentation and updates the code accordingly, but I agree it's not it's not something that is fully autonomous. Yeah. Yeah.

**** · Why is that?

**** · Which security specifically?

**** · Yeah.

**** · So are there security issues with MCPS?

**** · So think about it this way. MCPS depending on the data that you get access to might have different requirements, lower stake or higher stake. I'm not an expert at the full range. But it wouldn't surprise me that when you when you when you expose an MCP to an I think you would a lot of MCPs have authentication.

**** · So, you might need a code to talk to it just you would with an API or a key. yeah, but that's a good question. I'm, I'm not an expert at the security of these systems, but, we can look into it.

**** · Any other questions on what we've seen with the agentic workflows, APIs, tools, MCPS, memory? All of that is under progress. So, even memory is not a solved problem by any mean. It's pretty hard to get. Yes, you don't need to confer to access the API, but technically engineer your way to achieving something from the API you can do the same.

**** · Exactly.

**** · Exactly. Yeah. is MCP about efficiency or accessing more data. It's about efficiency. It's let's say you have a coding agent and it has an MCP client and there's multiple MCP servers that are exposed out there. that agent can communicate very efficiently with them and find what it needs. and it's it's a more efficient process than displaying APIs and the APIs on that side and how to ping them and what the protocol is but it's not about the data that is being exposed because ultimately you control the data that is being exposed you probably depending on how the MCP is built my guess is you probably expose yourself to other risks because your MCP server can see any input pretty much from another LLM and so it has to be robust. but yeah, super. so let's look at an example of a step-by-step workflow for the travel agent. So let's say the user says I want to plan a plan a trip to Paris from December 15th to 20th with flights hotels near the FL tower and then an itinary of must visit places that's the task to the travel agent. Step two, the agent plans the steps. So it says I'm going to find flights. Use the flight search API to get option for December 15th. Search hotels, generate recommendation for places to visit, validate preferences, budget, etc. Book the trip with the payment processing API. Step three, that's just the planning by the way.

**** · Step three, execute the plan. Use your tools, combine the results, and then proactive user interaction and booking. It might make a first proposal to the user and ask the user to validate or invalidate and then may repeat that planning and execution process. And then finally, it might update the memory. It might say, "Oh, I just learned through this interaction that the user only likes direct flights. Next time I'll only give direct flights."

**** · or I notice users are fine with threestar hotels or four-star hotels and in fact they're they don't want to go above budget or something that.

**** · so that hopefully makes sense by now on how you might do that. My question for you is how would if this works and if you had such a system running in production how would you [clears throat] improve it?

**** · Yeah, so that's an example. So let users rate their experience at the end. that would be an end to end test, **** · You're looking at the user experience through the steps and say how good was it from one to five, let's say. Yeah, it's a good way. And then if you learn that a user says one, what how do you improve the workflow?

**** · Okay, so you would go down a tree and say, "Okay, you said one what was your issue?" And then the user says the prices were too high, let's say, and then you would go back and fix that specific tool or prompt or Yeah.

**** · Okay. Any other ideas?

**** · Yeah, good. So that's a good insight.

**** · Separate the LLM related stuff from the nonLM related stuff. The deterministic stuff. The deterministic stuff you might be able to fix it, more objectively essentially. Yeah, there was what else?

**** · So, give me an example of an objective issue that you can notice and how you would fix it versus a subjective issue.

**** · Yeah.

**** · the flight which is cheaper directive that's okay so let's say you say there's the same flight but one is cheaper than the other let's say it's objectively worst and so you can capture that almost automatically yeah so you could build evals that are objective that are tracked across your users and you might run an analysis after and see that for the objective stuff. We noticed that our LLM AI agentic AI workflow is bad with pricing. It just doesn't read prices as well because it always gives a more expensive option. Yeah, you're perfectly How about the subjective stuff?

**** · Yeah.

**** · do you choose a direct or indirect flight if the indirect is a little bit cheaper?

**** · Yeah, good one. Do you do you choose a direct flight or an indirect flight if the indirect is cheaper but the direct is more comfortable? yeah, that's a good one so how would you capture that information? Let's say this is used by thousands of users.

**** · could you feed something in about us?

**** · could you feed something in? Yeah, you could you could u could you feed something in about the user preferences? Well, you could you could build a data set that has some of that information. So, you build 10 prompts where the user is asking specifically for direct is saying that I prefer direct flights because I care about my time, let's say. And then you look at the output and you give a good the example of a good output and you probably are able to capture the performance of your agentic workflow on this specific eval whether does it prioritize does it understand price conscious is it price conscious essentially and comfort conscious what about the tone let's say let's say the LLM now is not very friendly how would you notice do that and how would you fix it?

**** · Yeah, test user and run prompts and see if there's something wrong with Okay, have a test user run the prompt and see if there's something wrong with that. Tell me about the last step. How would you notice that something is wrong?

**** · So have a couple of evaluates response and see if it's satisfied.

**** · Yeah, I agree with your approach. Have LLM judges that evaluate the response against a certain rubric of what politeness look So here in this case you could start with error analysis. So you start you have a thousand users and you can pull up 20 user interaction and read through it and you might notice at first sight the LLM seems to be very rude. it's just super short in its answers and it's not very helpful. you notice that with your air analysis manually. Then you go to the next stage.

**** · You put eval behind it. you say, "I'm going to create a set of a set of LM judges that are going to look at the user interaction and are going to rate how polite it is and I'm going to give it a rubric. Then what I'm going to do is I'm going to flip my LLM. Instead of using GPT4, I'm going to use Grock.

**** · And instead of using Gro, I'm going to use Lama. And then I'm going to run those three LLM side by side, give it to my LLM judges, and then get my subjective score at the end to say, "Oh, X model was more polite on average."

**** · Yeah, perfectly That's an example of an EVA that is very specific and allows you to choose between LM. You could do the same eval across LM, but fix the LLM, change the prompt.

**** · You instead of saying act a travel agent, you say act a helpful travel agent and then you see the influence of that word on your eval with the LLM as judges. Does that make sense? Okay. super. So let's let's move forward and do a case study with eval and then we're we're almost done for today. let's say your product managers manager asks you to build an AI agent for customer support.

**** · Okay, where do you start? And here is an example of the user prompt. I need to change my shipping address for order blah blah. I move to a new address. So what do you start if I'm giving you that project? **** · Yes.

**** · So do some research, see benchmarks and how different models perform at customer support and then pick a model. That's what you mean. Yeah. You It's true. You could do that. What What else could you do? Yeah.

**** · Okay.

**** · Yeah, I that. Try to decompose the different tasks that it will need and try to guess which ones will be more of a struggle, which ones should be fuzzy, which one should be deterministic. Yeah, you're to sit down for a day or two with a customer and see how the task probably task.

**** · Yeah, similar to what you said. That's what I would recommend as well. You say I would sit down with a customer support agent for a day or two and I would de compose the task they're going through.

**** · I will ask them where do they struggle, how much time it takes. Yes, that's usually where you want to start with task decomposition. So let's say we've done that work and we have this list I'm simplifying but the customer support agent human typically would extract info then look up in the database to retrieve the customer record then check the policy are we allowed to update the address or is it a fixed data point and then draft the response email and sends the email okay so we've decomposed that task u once you've decomposed that task ask how do you design your agentic workflow?

**** · Yes.

**** · each step which one which one we're going to use method or whatever in each task what are you going to use for resources exactly so to repeat I'm going to you're going to look at the decomposition of tasks get an instinct of what's fuzzy what's deterministic and then determine which line is going to be an LLM one shot which one will require maybe a rag which one will require a tool, which one will require memory, which one so you will start designing that map completely that's also what I would recommend you might draft it and say okay I take the user prompt and the first step of my task de composition was extract information that seems to be a vanilla LLM you you can guess that the vanilla lm would probably be good enough at extracting the user wants to change address and this is the order number and this is the new address. You probably don't need too much technology there other than the LLM. the next step it feels you need a tool because you're going to have to look up in the database and also update the address.

**** · So that might be a tool and you might have to build a custom tool for the LLM to say let me connect you to that database or let me give you access to that resource with an MCP. Yeah.

**** · After that, you probably need an LLM again to draft the email, but you would probably paste confirmation. You paste a confirmation that your address has been updated from X to Y. And then the LLM will draft an answer. And of course, just to not forget, you might need a tool to send the email. you might need to, post something to for the email to go out and then you'll get the out. Does that make sense? So, exactly what you described. [sighs] Okay, now moving to the next step. Once we have decomposed our tasks, then we have designed an agentic workflow around it. It took us five minutes. In practice, it would take you more if you're building your startup on that. You want to make sure your task de composition is accurate, your thing is accurate here. And then you can have a lot of work done on every tool and optimize it and latency and cost. But let's say and now we want to know how if it works, and I'm going to assume that you have LLM traces. LLM traces are very important. if you're interviewing with an AI startup, I would recommend you in the interview process to ask them, do you have LLM traces? because if they don't have LM traces, it is pretty hard to debug an LLM system, because you don't have visibility on the chain of complex prompts that were called and where the bug is and so it's a basic part of an AI startup stack to have LLM traces. [laughter] So let's assume you have traces. How would if your system work? we I I'm going to summarize some of the things I heard earlier. you gave us an example of an end to end metric. You look at the user satisfaction at the end. you can also do a component-based approach where you will look at the tool, the database updates and you will manually do an error analysis and see, oh, the tool always forgets to update the email. it just fails at writing, and I'm gonna fix that. This is deterministic pretty much. or when it tries to send the email and ping the system that is supposed to send the email, it doesn't send it in the format and so it bugs at that point. Again, you could fix that. draft of the email, the LM doesn't do a great job. It's not very polite at drafting the email, So you could look at component by component and it's easier to debug than to look at it end to end. You'll probably do a mix of both. another way to look at it is what is objective versus what is subjective. So for example an objective example would be the LLM extracted the wrong order ID. the user said my order ID is X and the LLM when it pasted looked up in the database it used the wrong order ID. this is objectively wrong. You can write a Python code that checks that checks just the alignment between what the user mentioned and what was pasted in the database or for the lookup.

**** · You also have subjective stuff which we talked about where you probably want to do either human rating or LLM as judges. It's very relevant for subjective evals.

**** · [snorts] and finally you will find yourself having quantitative evals and more qualitative evals. So quantitative will be percentage of successful address updates. The latency you could track the latency component based and see which one is the slowest. Let's say sending the email is 5 second it's too long let's say you would notice component based or the full workflow.

**** · And then you will decide where am I optimizing my latency and how am I going to do that. And then finally qualitative you might do some error analysis and look at where are the hallucinations where are the tone mismatches are the user confused and by what they're confused that would be more qualitative and typically it would take more white glove approaches to do that okay so here's what it could look I gave you some examples but you would build evals to determine objectively, subjectively, component based, end to- end based and then quantitatively and qualitatively where is your LLM failing and where it's doing well.

**** · Does that give you a sense of the type of stuff you could do to fix improve that agentic workflow?

**** · Super. Well, that was our case study on Evas. We're not going to del deeper into it, but hopefully it gave you a sense of the type of stuff you can do with LLM judges with, objective, subjective, component-based, end to end, etc. U last section on multi- aent workflows.

**** · So you might you might ask hey why do we need a multi- aent workflow when we when the workflow already has multiple steps already calls the LLM multiple times already gives them tools why do we need multiple agents and so many people are talking about multi- aent system online it's not even a new thing frankly multi- aent system have been around for a long time the main advantage of a multi- aent system is going to be parallelism it's is there something that I wish I would run in parallel independently but maybe there are some syncs in the middle but that's where you want to put a multi- aent system it's when it's parallel the other advantage that some companies have with multi- aent system is an agent can be reused. So let's say in a company you have an agent that's been built for design. That agent can be used in the marketing team and it can be used in the product team, and so now you're optimizing an agent which has multiple stakeholders that can communicate with it and benefit from its performance.

**** · I'm going to ask you a question and take a few maybe a minute to think about it. Let's say you were building smart home automation for your apartment or your home. What agents would you want to build? Yeah, write it down and then I'm going to ask you in a minute to share some of the agents that you will build. Also, think about how you would put a hierarchy between these agents or how you would organize them or who should communicate with who. Okay. Okay. Take a minute for that.

**** · be creative also because I'm gonna ask all of your agents and maybe you have an agent that nobody has thought of. Okay, let's get started. Who wants to give me a set of agents that you would want for your home smart home? Yes. So the first is set of agents that track my movements in the house and drop information about my house.

**** · Another agent receive that information and adjust the room temperature and another usage of okay so let me repeat you have four agents I think roughly one that tracks biometric you're where are you in the home where you're moving, how you're moving, things that. That knows your location. The second one determines the temperature of the rooms and has the ability to change it. The third one tracks energy efficiency and might be feedback on energy and energy usage and might be I don't know maybe it has the control over the temperature as well. I don't know or the gas or the water might cut your water at some point the and then you have an orchestrator agent. What is exactly the orchestrator doing?

**** · Instructions.

**** · Okay. Passes instructions. So is that the agent that communicates mainly with the user?

**** · Yep.

**** · Okay.

**** · So if I have I'm coming back home and I'm saying I want the oven to be preheated, I communicate with the orchestrator and then it would funnel to another agent. Okay, sounds good. Yeah, so that's an example of a I want to say a hierarchical agent multi- aent system. what else? Any other ideas? What would you add to that? Yeah.

**** · minimal action that you can do. Imagine entering a room or just entering a computer or just opening the minimum action. you have lot of agent per [clears throat] and then depending on who is it and all the contact you have oh I that's a really good one so let me summarize you have a security agent that determines if you can enter or not and when you enter it understands who you are and then it gives you certain sets of permissions that might be different depending of if you're a parent or a or you might have access to certain cars and not others or the kid cannot open the fridge or I don't know something that.

**** · Yeah. Or okay. I that. That's a good one. Yeah. And it does feel it's a complex enough workflow where you want a specific workflow tied to that. I agree. [snorts] What else?

**** · Yes.

**** · Continuing on the ambient stuff, you can get more complicated. So energy savings with keeps open as well from the grocery store understand what's in your fridge or not. who are out to.

**** · Well, that's really good So, you mentioned two of them. One is maybe an agent that has access to external APIs that can understand the weather out there, the wind, the sun, and then has control over certain devices at home, temperature, blinds, things that, and also understands your preferences for it. that does feel it's a good use case because you could give that to the orchestrator but it might lose itself because it's doing too much. So you probably and also these problems are tied together temperature outdoor with the weather API might influence the temperature inside how you want it etc.

**** · And then the second one which I also is you might have an agent that looks at your fridge and what's inside and it might have access to the camera in the fridge for example. and know your preferences and also has access to the e-commerce API to order Amazon groceries ahead of time. I agree and maybe the orchestrator will be the communication line with the user but it might communicate with that agent in order to get it done. yeah I those. So those are all really good examples here. Here is the list I had up there. So climate control, lighting, security, energy management, entertainment, notification agent, alerts about the system updates, energy saving and orchestrator. So all of them you mentioned and then we didn't talk about the different interaction patterns, but you do have different ways to organize a multi- aent system. flat hierarchical. It sounds this would be hierarchical. I agree. And the reason is UIUX is I would rather have to only talk to the orchestrator rather than have to go to a specialized application to do something it feels the orchestrator could be responsible for that. And so I agree I would probably go for a hierarchical setup here. But maybe you might act also add some connections between other agents in the flat system where it's all to all for example with climate control and energy if you want to connect those two. You might allow them to speak with each other. When you allow agents to speak with each other it is an MCB protocol by the way. So you treat the agent a tool exactly a tool.

**** · Here is how you interact with this agent. Here is what it can tell you. Here is what it needs from you essentially. Okay, super. And then without going into the details, there are advantages to multi- aent workflows versus, single agents such as debugging. It's easier to special debug a specialized agent and to be debug an entire system.

**** · Parallelization as well. It's easier to have things run in parallel. and you can earn time. there are some advantages to doing that. And I leave you with this slide if you want to go deeper. Super. So we've learned so many techniques to optimize LLMs from prompts to chains to finetuning retrieval and to multi- aent system as well. And then just to end on a couple of trends I want you to watch. I think next week is Thanksgiving. Is that it? Is Thanksgiving break? No, the week after. Okay. Well, ahead of the Thanksgiving break. So if you're traveling, you can think about these things. what's next is in AI, I wanted to call out a couple of trends.

**** · so Elas discover one of the OGs of LLM's and OpenAI co-ounder raised that question about are we plateauing or not? the question of are we going to see in the coming years LLM not improve as fast as we've seen in the past. It's been the feeling in the community probably that the last version of GPT did not bring the level of performance that people were expecting although it did make it so much easier to use for consumers because you don't need to interact with different models it's all under the same hood so it seems that it's progressing but the plateau is unclear the way I would think about it is the LLM scaling laws tell us that if we continue to improve compute and energy then LM should continue to improve but at some point it's going to plateau so what's going to take us to the next step and it's probably architecture search still a lot of LLM even if we don't understand what's under the hood are probably transformer based today but we know that the human brain does not operate the same way there's just certain things that we do that are much more efficient much faster we don't need as much data so theoretically we have so much to learn in terms of architecture search that we haven't figured out. It's not a surprise that you see those labs hire so many engineers because it is possible that in the next few years you're going to have thousands of engineers trying to figure out the different engineering hacks and tactics and architectural searches that are going to lead to better models and one of them suddenly will find the next transformer and it will reduce by 10x the need for compute and the need for energy. it's if you've read Isak Azimov's u foundation series individuals can have an amazing impact on the future because of their decisions. whoever discovered transformers had a tremendous impact on the direction of AI. I think we're going to see more of that in the coming years where some group of researcher that is iterating fast might discover certain things that would suddenly unlock that plateau and take us to the next step and it's going to continue to improve that. And so it doesn't surprise me that there's so many companies hiring engineers now to figure out those hacks and those techniques. the other set of gains that we might see is from multimodality.

**** · So the way to think about it is we've we've we've had LLM first text based and then we've added imaging and today models are very good at images.

**** · They're very good at text. Turns out that being good at images and being good at text makes the whole model better. So the fact that you're good at understanding a cat image makes you better at text as well for a cat. Now you add another modality audio or video, the whole system gets better. So you're better at writing about a cat if what a cat sounds if you can look at a cat on an image as well.

**** · Does that make sense? So we see gains that are translated from one modality to another. And that might lead in the pinnacle of robotics where all these modalities come together and suddenly the robot is better at running away from a cat because it understands what a cat is, how it sounds what it looks etc. That make sense? the other one is the multiple methods working in harmony. In the Tuesday lectures, we've seen supervised learning, unsupervised learning, self-supervised learning, reinforcement learning, quality engineering, rags, etc. If you look at how babies learn it is probably a mix of those different approaches a baby might have some metalarning meaning it has some survival instinct that is in the encoded in the DNA most likely and that's the baby's pre-training if you will on top of that the mom or the dad is pointing at stuff and saying bad good bad good supervised learning on top of that the baby's falling on the ground and getting hurt and that's a reward signal for reinforcement learning. On top of that, the baby's observing other people doing stuff or other babies, doing stuff, unsupervised learning. You see what It's we're probably a mix of all these methods and and I think that's where the trend is going is where those methods that you've seen in CS230 come together in order to build an AI system that learns fast, is low latency, is cheap, energy efficient, and makes the most out of all of these methods.

**** · finally, and this is especially true at Stanford, you have research going on that you would consider human centric and some research that is nonhuman centric. By humancentric, I should say human approaches that are modeled after the brain and approaches that are not modeled after humans because it turns out that the human body is very limiting. And so if you only do research on what the human brain looks you're probably missing out on compute and energy and stuff that you can optimize even beyond neuronal connections in the brain. But you still can learn a lot from the human brain. And that's why there are professors that are running labs now that try to understand how does back propagation work for humans. And in fact, it's probably that we don't have back propagation. We don't use back propagation. We only do forward propagation, let's say. So this type of stuff is interesting research that I would encourage you to read if you're curious about the direction of AI. and then finally one thing that's going to be pretty clear I call it all the time but it's the velocity at which things are moving. You're noticing part of the reason we're we're giving you a breath in CS230 is because these methods are changing so fast. So I don't want to bother going and teaching you the number 17 methods on rag that optimizes the rag because in two years you're not going to need it, So I would rather you think about what is the breadth of things you want to understand and when you need it you are sprinting and learning the exact thing you need faster because the half-life of ski is so low you want to come out of the class with a good breath and then have the ability to go deep whenever you need after the class and so that's how that class is designed as well. yeah that's it for today. So thank you thank you for participating.