Transcript

Hook

0:00 · When people think about the ability of an AI to run your app and test it, I think they actually overindex on the computer use part of it because computer use in my mind is the literal okay you want you know a button you want to click can you emit the right coordinates to go click that button. I think testing is actually a really interesting problem solving uh challenge for these AIs because if you wanted to do arbitrary testing like imagine you make a change that spans the front end and the back end to actually test that change we have to reason through what how do you first run these applications to orchestrate with each other with the right version of the code then okay how do I trigger the feature or how do I make the thing actually happen that is where we spend most of our time before we get into today's episode, I just have a small message for listeners. Thank you. We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis.

1:01 · But fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button.

1:15 · It's the only thing I'll ever ask of you. And it means absolutely everything to me and my team that works so hard to bring the inspace to you each and every week. If you do it, I promise you we'll never stop working to make the show even better. Now, let's get into it.

Introduction

1:33 · [music] All right, we're in the studio with Walden Yen, co-founder, CPO.

1:38 · Yeah.

1:39 · Which is cool title. Um, yes. And one coiner of context engineering. Yes. Yes. Although I think there were many people who used the terms in various ways beforehand. But um I I I did find that people both internally and externally enjoyed the upgrade from font engineering or you know model wrapping into maybe a more thoughtful way to build agents.

2:03 · Yeah.

2:03 · For uh for those who haven't caught up on that, I have on screen the don't build multi-agents post which you should read read on and we might refer to. And Cole Murray who created open inspect.

Why Everyone Is Building Their Own Devin

2:13 · Great to be here. Okay, so let's talk about it. Everyone is building their own Devins. Um, what's going on? [laughter] Yeah, so I think the engineering world is kind of waking up to this idea of background agents, cloud agents, uh, whatever you'd like to call it. And I think we saw a shift around the December time frame of 2025 where the models Opus 4.5 and GPT 5.2, to they reached a capability where we moved away from kind of handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request assuming the spec was good enough uh with very little friction. And that paradigm alone I think changed a lot of how we interact with agents um and kind of opened this world where background agents became more practical.

3:11 · I think for for call everyone experienced this in December but I feel like there was just this increasing ramp right like um there was this the moment which was I think sonnet 37 where like you guys rewrote Devon in one night or something.

3:26 · Yes. Yeah. Yeah. So describe 2025 or you know how how it felt from your side.

Devin’s 2025 Ramp: 7x PR Growth and 80% of Commits

3:31 · In retrospect you know we always thought it was ramping up but then even now over the last 3 4 months from today we it's been ramping up even faster. So it's almost funny to be talking about how like big of a leap Sonet 3.7 was and we honestly a lot of it was stripping out parts of Devon that were no longer needed with that jump in intelligence.

3:52 · But I also just think that a lot of the recent leaps uh especially you know you look at like models like opus and latest GBT models they are reaching levels of autonomy where people are actually fighting that they actually can't just be hands off and people who were once debating oh you know do I need to be in the weeds with my model in the IDE um can I just completely move it off into the cloud that's a that's a more serious conversation and we've seen that in in all of our growth charts um internally there's this funny graph where our usage has uh of PRs or our merged PRs has grown 7x since I I I forget what I think Dave uh maybe tweeted that.

4:30 · Yeah. Uh yes.

4:31 · Um it grew like 7x over like the last I think it was like 2 months, 3 months, something like that. Uh and then you see our engineing headcount growth. It's like gone up by like 10% or something like we were we were afraid to release this. So, so this is Devon commit percentages on all Devon repos uh was 16% in January and now 80% in March.

4:52 · Yeah, [laughter] it's like uh it's a big shift right now. Um and so it makes sense that a lot of people are now thinking about you know buying Devon but also maybe like you know trying to build their own and and there's lots of I I have a lot of fun building Devon so I can see why other people would want to build their own cloud agents as well. Yeah.

5:13 · Well, maybe it's it's good to hear like what what initially inspired you to try to build open inspect.

OpenInspect and the Rise of Open-Source Background Agents

5:19 · Yeah, open inspect came about uh through primarily my clients observing how they were using tools like cloud web um openai's codeex at the time and seeing some of the friction that they were having with it. Um, primarily the claw web uh was being used through Slack and a big issue they ran into was that these sessions that were launched were specific to whoever called it via Slack and so if a PM was the one who invoked the session and they would then go to pass context engineering engineering can't see the session and that in itself was kind of a dealbreaker because the PM hey engineering can you jump in but there's nothing to jump in on unless they're copy pasting out or you know the single response that came back. Um and so kind of seeing some of these problems, I had built a similar kind of architecture internally um just to experiment with um kind of test out different ideas as this trend of moving off of local host was starting to kind of become um and as RAMP released their blog post, I had a lot of the pieces for this already in place. Um, and just thought it would be kind of funny to uh see what Claude could do just purely from the blog post and uh on my ex account there's actually kind of a a thread of where I live tweeted like going through this.

6:43 · Oh wow.

6:44 · Uh comparing GPT and Claude as both of them are going through it like on the announcer thing or something else.

6:50 · Uh right after it got released. Okay. Um we can put it in the show notes.

6:54 · Um yeah, it was helpful that I had already kind of knew how to verify the system. I knew what I was looking for. I think RAMP did a great job of really illustrating uh the technical aspects of how to build something. Um it was much more than just kind of like hey we built a great system. It was and here's how you can build it too. And so um I resonated a lot with that just with the problems that I was already seeing. And I thought that uh looking around I didn't really see anything in the open source community that um kind of met this type of system. I think there's a lot that run uh in localhost like superset conductor um and many others but nothing that was actually running in the cloud and so um I built it and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.

7:46 · So literally after Devon was launched was there was open devon which became all all hands. Uh I don't know if you tried that or Yeah. Well, I was going to say one of the things that interested me a lot with open inspect was like you didn't try to go make it then something you you monetize. Um there are a lot of I think these open source projects would then go really try to like raise vis.

8:07 · [laughter] Um yeah and uh how did you think about that? I I thought that was very interesting. I thought and kind of just what I had seen across my clients was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. Um to that question though, I get asked all like, "Oh, are you going to raise are you going to turn this into a service?"

8:38 · I'm sure you've gotten offers. Uh but uh primarily I don't want to do that for a few reasons. One, I think that I don't want to compete for like $20 a seat. I think that that is just a really difficult business. I think it's very easy to copy the main pieces of it. I mean, again, like I built this fairly quickly and I think because you are not owning I guess the entire stack, it's hard to monetize. You have money being made at the sandbox layer with Daytona, E2B, many other players. You have money being made at the model layer. Um, and you kind of sit in this weird in between gray area where what are you actually selling? You're selling I guess the infrastructure. You're selling uh the integrations maybe. Um, let's ask the guy what are you what are you selling?

9:27 · [laughter] Well, yeah, there's multiple layers to this in in practice. And actually, it's funny you mentioned the infrastructure cuz when we got started building Devon as well, uh we had to go figure out how to make the infrastructure as well because you had to build this two years before everyone else, you know.

What Cognition Actually Sells Beyond Devin

9:43 · Yeah, exactly. And [snorts] including like the side, it was not very polished at the start like when we just built it off of raw VMs from cloud providers like EC2, the bootup time was so slow, I think. Uh, and especially the then like turning off the machines, saving them, and then be able to bring them back up again when the when you want Devon to wake up again later. Uh, it would just be out cold for like 10 minutes because that's just how long these systems took.

10:10 · They were not built for this repeated down and up usage. And so we actually had to go do all of that. Um, and as a result, now one thing we offer when we go and sell Devon to people is, you know, you don't have to worry about all the compute side of things. We'll make it work. will make it work in your cloud if you wanted to. Um, but aside from the product and I I want to go into the agents and and and the tuning of the intelligence part later, but I think a big part of what we do cognition as well is to just make sure that your company learns and uses and adopts these coding agents cuz I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. Um, but because of the way projects are planned, because um, not everyone is literate in using AI in these ways, uh, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, you know, leverage with AI is super helpful. And so, total, we do that. We we show up as thought partners to the customers that we work with as well. So let's talk about like architectural stuff. I think that's always u is something that was the topic of conversation between the two of you.

Background Agent Architecture: Harness In vs Out of the Box

11:34 · Is this uh sort of like the mental model that you want to start with or something else? I I you know I'll just kind of leave the floor open to you guys.

11:41 · Yeah, I think that maybe we can start here is just kind of a a general what are the pieces of a background agent system and then maybe we can go into some of the nuances of uh decisions that you can make. But I guess like also like what maybe what Walden is saying is is the agent is kind of like in this open code box I guess right like this is infra and then there's there that's the agent and you had this discussion about whether you put the agent in here or in externally can you sort of tease that out yeah in a background agent systems you have a decision to make of where the agent is actually going to run this is typically described as the harness in the box or out of the box.

12:20 · Yeah.

12:20 · with running the agent in the box.

12:23 · Uh you're making some trade-offs by doing that. The negative trade-off you're making is primarily security because the agent is running in that box. Unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable and you could very easily end up accidentally x-filling your secrets um or you know other kind of unintended behavior.

12:50 · Now the out ofthe-box is the idea that we are going to have the actual agent running not directly in the sandbox and we'll have quote unquote the brain of the agent running in some type of worker uh control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you're making between the two systems is that um in my opinion running it out of the box is much more complex because uh you have state that has to be managed whereas if you're running it in the box uh all of the state of that agent is actually in the box and yes it's you could persist it elsewhere but it's all kind of localized and you have less concerns to worry about. I think a lot of that what you mentioned is why we actually from the start build Devon to what we called separate the brain from the machine.

Separating the Brain from the Machine

13:46 · Uh the other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes perhaps and so you don't have to worry as much about making a new type of dev box that has all the dependencies the brain needs as you mentioned the secrets the brain needs as well. Um, like one thing that we've seen some customers run into is like you have a GitHub app and you want Devon, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions.

14:20 · if they are all interacting through the same GitHub app and there's no actual like separation between the system that decides uh what it does and the actual secrets on the machine then you kind of run into an issue where okay it's hard to do the separation but in practice with Devon it's much easier because we just say whatever you put on the machine that is like kind of the scope of basically what the user is free to do what the agent is free to do so only put the most scope secrets on that machine and then the brain is fully not not accessible from the machine. So you don't have to worry about messing with the any of the most secure parts of the brain if the user is free to do whatever they want with the machine. I was going to just bring I have this like chart from opening uh where I don't know if this is like in the box out of the box that is something that they do use to describe it. Uh and then also recently Enthropic did like manage agents which is this is their thing.

15:16 · I don't know it's all it's all variations of the same pattern, right?

15:19 · Yeah. So this would be out of the box.

15:21 · Yeah.

15:22 · Which like is is preferable for them because it's less work.

15:26 · Uh I would say it's more work but it in my opinion it is the better architecture of the two. Okay.

15:32 · It's just uh you're taking on a bit of complexity by doing that. One thing I've not seen a lot of other players do well is how do you manage what's actually on the box? And this can be complex for many reasons. Like let's say you have a big repository that's changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let's say, run the app and test it, all the things you want your autonomous repo set up.

Repo Setup, Secrets, Docker, and Full VMs

16:05 · Yeah, exactly. So in internally at cognition uh we call this repo setup.

16:09 · The hardest part of it's been a perennial problem since the start of the company of how do we help people get the set up because not everyone just has you know uh working cloud environments working out of the box and do do you find this to be a common problem with your clients?

16:24 · Yeah, this is a very common problem and uh through my consulting this is a lot of what I help teams do. Um a lot of teams don't really have great developer environment setups if any. Um, a lot of the times it's go talk to Bob and get the secrets and that obviously doesn't work when the agent needs to actually set this up. And so a lot of that uh most teams are using docker compose or some type of microservices. And so in prod not in prod um like with the open inspect you are using this primarily to interact um and make code changes. There is other use cases but um you can hook whether through CLI, MCPS, other tools um you can then hook that into your production systems primarily for like S sur type use cases but you are not uh necessarily like trying to test your prod internal microser through the system. Yeah, and you mentioned Docker Compose. I think one direction we saw some of our friends take early on was um using Docker containers as a level of abstraction for their models. Um there's lots of reasons I think why Docker containers are not great. Um one thing is like Docker containers not really a true security boundary for one. Uh but the other is like if you are running real applications a lot of times those applications use Docker and then you have to think about Docker and Docker which is like really weird.

17:47 · Yes.

17:47 · And so I think part of like the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed like full VMs to be able to do these types of things. And especially nowadays where there's actually value in running the application and clicking around and sending you screen recordings of these things, like the value just kind of like keeps adding on on top of that. Um, but it is a decision I I see people run into when they try to build their own systems is oh like do we like in addition to this do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?

18:24 · What what do you recommend people nowadays?

18:27 · I think Docker is a good solution for maybe not running the agent but running your infrastructure because that is more or less the same setup your engineers are probably already using. Um, if they're not, then I don't know what they're using, but [laughter] uh they're probably already using Docker Compose.

18:44 · Yeah, I I've always had a small candle for web containers. I don't know if you guys have tried them before. To me, they were like supposed to be like Docker Light.

18:52 · No, I haven't tried it.

18:54 · Um, but yeah, I think any environment that you've set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out kind of that local developer story, um you've kind of more or less solved the agent in a sandbox uh environment setup. Open Inspect does have hooks as well where you can uh run a setup SH script that will pre-install everything. You can then pre-napshot that build so it starts instantly and then there is a a second hook to actually then like restore the state of the sandbox when it comes back.

19:30 · And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.

19:38 · Another thing that we've been thinking a lot about is kind of like different VM service offerings. Have you had customers where they needed like Mac OS specific VMs or like Windows specific VMs?

19:51 · Not yet.

19:52 · There are like many technologies in the world that only work on specific types of machines, right? If you're building a application that has to run on Windows or like you know maybe more commonly if you want to build iOS or Mac OS permission support choices like that the fundamental architecture we do because we do the separation it does support but the actual work in progress is happening right now on these another thing that we've actually recently added support now for it's in beta is doing Android development to do that we needed to support I think nested virtualization within our machines because the the VM itself is like a is a virtualized firecracker instance and then you have to then run another Android emulator inside. Um and you know there's like weird performance issues that like you know which is why it's like still in beta. We have to think through these problems but it unlocks a lot for for anyone who wants to do Android development.

Why Testing Is Harder Than Computer Use

20:43 · I was trying to find like a reference video for the the testing thing. I couldn't find it but uh I I think you worked on the testing uh capability. Why you call it testing and not like computer use or I don't know what's what's the general category of problem?

20:56 · I think that when people think about the ability of an AI to run your app and test it, I think they actually overindex on the computer use part of it because computer use in my mind is the literal okay you want you know a button you want to click can you emit the right coordinates to go click that button. I think testing is actually a really interesting like problem solving. Yeah.

21:19 · Uh challenge for these AIs because if you wanted to do arbitrary testing like imagine you make a change that spans the front end and the back end um maybe you know even some other like even more deeply nested service. To actually test that change we have to reason through what how do you first run these applications to orchestrate with each other with the right version of the code. Then okay, how do I trigger the feature or how do I make the thing actually happen? Um, and this can kind of get arbitrarily hard like maybe you have to be an admin, maybe a certain thing has to be feature flagged on, maybe uh you have to like run two sessions and then send us a very specific word into one of them to trigger specific behavior. And figuring out how do you do that requires a lot of codebase context requires uh a lot of orchestration that we've specifically done and in some cases we found that you actually no one frontier model can actually do this full end to end task itself. We we've seen cases where we actually had had to orchestrate different frontier models together to kind of solve this problem together.

22:18 · That is where we spend most of our time when we think about this testing problem. Not so much the computer use part. Computer use for what it's worth has gotten a lot better with with recent models and it's it's made that part of the job certainly easier. Yeah, especially with like even 47 um that they released yesterday apparently like way better in terms of the the vision stuff which is going to be encompassing computer use. Having evals for all these as well is something that like takes a while to build up. Um and having the eval be right is tricky as well. Do do you ever see c like you know clients who are building their own agents have to start standing up evals to make sure things don't regress? not so much eval in the traditional sense, but um specific to the testing part that has just gone in. Um I just added support for screenshots and um in theory you can also do video. I need to put in a plugin to do that. But uh they do kind of show up natively and it was a very heavily requested feature um especially after cursors recording came out. Um I think that was very enlightening for everyone of like oh this is a very good feature to actually have. Um I think with Devon you guys have had this for a while.

23:24 · Yeah, first. Yeah. [laughter] Oh, yeah. I see how screenshots work. Um, yeah. I I don't know if there's any anything uh super not obvious. It's it's kind of like once you know what feature to build, you can just kind of prompt it and it most work.

23:40 · Um, I think to Walden's point though, the computer use is kind of a subset of the larger testing problem. And I think that that's very specific to the codebase that you're working in. It's not something that uh you know out of the box that you could just solve it.

23:56 · The you do need the codebase context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that codebase locally that you know what is changing and could then inspect it and use that to drive the model.

Video Verification and the “I Know It Works” Merge Moment

24:10 · Yeah.

24:10 · Yeah. Uh for those who haven't seen it before, this is an example of how it works. Like you uh after the PR is done, you click testing approved and then it sends you back a video. Uh what I really like is that it labels um it's very small here, but it actually labels what it's testing. Um and then it and then you actually see the the cursor and everything.

24:31 · So I don't know like uh yeah the engineering in this like just whatever whatever you want to show because this is like you know this is one of those like oh feel the AGI moments right like cuz once I look at this I actually don't I I wish I can just merge inside of Slack instead of going to GitHub because I don't need to see the code. I know it works. May maybe a new feature coming.

GitHub UX, Devin Review, and AI Code Review

24:50 · [laughter] Yeah. Um the annotations at the bottom was also a big difference for me when I when I added those.

24:57 · Yeah.

24:57 · It's just like what what am I looking at? What are you trying to demonstrate exactly? Um there's a surprising long tale of small details that ends up making a big difference for this kind of end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools.

25:20 · Sure.

25:20 · Because I think um most tools out there when you build the agent you'll think about oh like it will create the PR for you. We we tried to take that a step further and say, oh, like what if we actually made sure you could interact Devon with D Devon directly on GitHub.

25:34 · And so we made sure that you could comment on GitHub and Devon would actually receive those comments and address them back. But there's actually quite a bit of tuning you have to do here because you can imagine that actually like we recently have Devon review for example. Um Devon review will post comments on his own PR and then Devon has to then go he answers his own comments which is really really loopy. Uh so like yeah I like that it just updates here that it's that I have commented but uh usually it's just me saying like hey merge uh fix any merge conflicts. [laughter] Yeah the so when Dan fixes its own comments you might be scared that oh maybe I'll infinite loop but we put a lot of work into making sure it doesn't um both by making sure that the comments are high signal but also that the agent is thoughtful about what comments it goes tries to fix and what comments it's like wait a second I think you're wrong.

26:25 · Actually, that's one of my favorite moments is when Devon tells me that I'm wrong when I I try to get it to do something different. Yeah.

26:30 · But tuning that behavior like actually makes a big difference in terms of how useful the actual GitHub experience is.

26:35 · Yeah.

26:35 · Yeah. I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. Um, Open Inspect does have that. It has a GitHub code reviewer that you can control the prompt. Um, it does do comments as well. It doesn't do them automatically yet. Um the capability is there but it's not fully so you have to ask for it.

26:58 · Uh you do yeah you can tag it on GitHub and then whatever you named your uh GitHub bot it will then follow up on it. It will then uh if you have merge conflicts or whatever you have asked it to resolve it will then resolve it but it doesn't do it automatically yet.

MCP, Slack, and Enterprise Agent Integrations

27:12 · Well I'm curious what is like the most common thing that people end up requesting uh that they still need on top of open inspect when you help them go implement it. I think a lot of it comes down to actually integrating it into the company. It's one thing to kind of have the background agent system set up, but if it isn't actually integrated into your larger ecosystem, it isn't that useful. I mean, it is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with readonly credentials, the logs, um, a confluence or internal knowledgebased system. I think that is where I see the huge leap for companies and um that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it. Um especially if they're in environments that have more compliance type things where um access control can be pretty big and how do you deliberately think about these problems um I find to be uh kind of one of the problems that comes with a system like this. Yeah, the thing we've found is so like MCPs um obviously has been like really big explosion of oh you can go like integrate it with all these different things. Um but to actually get the integration right and the get the right experience often times we found that we had to go build our own ad hoc things. Um I think Slack is a great example of this like you could give your agent Slack NCP and okay you can post message back to you on Slack. Um, but we actually use Devon like a co-orker in Slack and that's how it's been built from the ground up. But to do that, you actually need to like support web hooks that come back, right? And then Devon has to respond in a natural way and then hopefully don't spam your your threads too much and annoy the people in your company. So you got to tune that experience just right. Um, especially when there's a lot of back and forth like we find that we actually had to go beyond the simple MCP integrations these places.

29:09 · I just pulled out the MCP marketplace. I know this is a fair amount of work. Um I I mean is is the answer to eventually take first party control of all the top MCPS? Like is that the I would love a world where you could have something that's more expressive than MCP that kind of like goes both ways like not just a set of tools but like a proper system that interacts back and lets it have the right experience with all these interfaces.

29:33 · So there actually is sampling in the MCP spec but nobody uses it right. And so I think that's the other part is like actually we found that when the MCP spec starts to get too complicated um it starts to lose its original promise of being like a simple one-step connect like now then we had to go figure out how to support all these different variations of things and it starts to look a lot like just building their first party integrations in a lot of these cases now.

29:59 · Yeah.

29:59 · I think it matters too how critical it is to your company, right?

30:03 · If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it. Yeah.

30:11 · Versus just whatever is off the shelf.

30:14 · Yeah.

30:14 · Awesome. Other MCPS, what else? Uh sorry. Well, I don't know if that's narrowing in too much on uh integrations, uh but what else like what other elements of building Open Inspect or Devon that you guys really sink on?

Memory, Knowledge, and Always-On Agents

30:29 · Yeah, I think uh a problem that comes up very frequently is this idea of memories or knowledge base.

30:35 · Oh boy. [laughter] Yes. How do you solve it?

30:38 · Uh so not solved yet as a short answer. Um it's something there's a open issue for it. Someone asking about it. Um okay, there's I I Dwiki hasn't indexed anything about memory yet.

30:50 · Um how I'm seeing it solved across my clients is primarily through skills. Um, I find that skills can be a good gap within that or updating cloud MD, but I think memory as a whole is a pretty unsolved problem and it is why I've been kind of hesitant to add it. I think there is parts of memory and um that can be addressed. But I think as a whole it's a very difficult retrieval problem.

31:14 · Oh my god, RAMP didn't write anything about memory. I see zero search results. No, memory can be quite tricky to get right because it's the retrieval but also the generation of the memories that can be really tricky like you don't want it to just like very specific.

31:29 · Walk us through the dev memory journey [laughter] because I know there's been a journey.

31:33 · Uh the first version of memory that kind of like stuck around for a while was a system we have called knowledge and the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devon things. So, okay, anytime you remind Devon, wait, no, that's not quite the way you're supposed to use get, like, we actually want Devon to say, hey, do you want me to actually just remember this for the future? and for you to just basically quickly approve or reject and for it to build up over time because I I find that like 95% I think some crazy stuff like that of of the memories that Devon has or all through these autogenerate things like very few people actually just want to sit down and write big docs on okay here's how you're supposed to work with the technology etc. The generation in the retrieval has been something that we've been trying to tune a lot over the years. Generation like you don't want it to remember something like like if you asked one time to like oh please open as a draft PR. You don't want to be like oh everyone forever now should get get their PRs as draft PRs. But you want you do want some kind of like common bear.

32:36 · Maybe you want to say like oh Cole generally likes things to be created as draft PRs. Same with retrieval. Like you know if you have thousands of these memories how do you actually make sure they're retrieved at the right time? Um and that can be quite tricky to do right without exploding the context with a bunch of useful yeah useless information surprising amount of just like eval work to just make sure that like memories uh remains a reliable system as new models come and go.

33:00 · Yeah.

33:01 · You have anything that you could share around like memory pruning um and then like kind of the temporal aspect of memory?

33:07 · Yeah, exactly.

33:08 · Yeah. The today the um so the things it could do is it could edit memories.

33:14 · I see. And so if your memory used to say like, oh, Cole likes to open everything as like a draft PR, then you can imagine, no, don't do that. And then it'll say, oh, do you want me to update the memory to be Cole will now want everything as, you know, open PRs? I think that at the same time, we don't know if this is going to be the final version of the system. Um, whatever we have here will probably like translate into the new system that we we'll be coming up with. But I think one big difference between two years ago and today is these these agents are really good at using anything that resembles a file system natively. [laughter] And so part of us is thinking, oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own? That that's been an interesting exploration. Um, also some ideas in in the skill space.

34:05 · I'm pulling up OpenClaw's memory thing right now. So memory openclaw has like this like daily memory journal thing, right? And you can I mean that is a file system you can kind of grub through and is a source of truth. I don't know if it's the best. It's probably super noisy, but at least if you lose something, you can discover it or you can apply some kind of um forgetting algorithm to like more ancient memories that don't get recalled again or something.

34:31 · One thing we've been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file like a memory.mmd or something and just kind of like be your permanent PM for a specific set of issues maybe. So we have like some slack channels internally maybe a slack channel dedicated to uh a specific product like deep wiki maybe and you can imagine that you want a Devon that never stops. It's just always awake but it has this like memory dock that it can just maintain for itself about okay what are like the number one priorities of what we have to fix and prioritize um who is responsible for some upcoming work. Maybe they'll even tag Devon will even tag you on some recurring basis. Um, and so it's been an interesting move to see, okay, how can we actually use Devon for more than just engineering? Can we actually upstream above the engineering process? And maybe it's just Devon creating tickets, which then maybe some humans do, but then maybe other Devons do.

35:30 · Yeah, one of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis.

35:36 · [laughter] That that's that's the automation. and I can't find it right now. But basically, it just like look at competitors and suggest things and here are three things that you've suggested that I don't want any more of and you just kind of stick that in a prompt. [laughter] But like I wish actually so like when I for example when I reject the PR I I wish that it updated memory so that I can then just not have to go up go back and update the scheduled uh sync but like feature feature request. [laughter] Um you know we might change it soon. I I guess open inspect uh in the time you've been around has there been anything you tried to implement which then you had to like undo and like do a different way.

36:11 · Nothing yet but something that is on my mind uh the initial way that I built it was that each of the integrations kind of lives as its own package and so you have the Slackbot which is what's handling the web hooks and then is basically interacting with the control plane. As I'm seeing the system starting to be more integrated, specifically with the GitHub bot integration, I'm considering bringing that all into the central control plane because especially now I want to start um and a request that I'm getting is the ability to monitor um the actual like pull requests being merged as well as just kind of tracking of like what do I have open?

36:51 · Yeah. What do I have open? How many of these are getting merged? how many comments are showing up to just kind of understand the health of the system.

36:58 · And so in the case of a GitHub app, you only have one web hook. And so then it's a question of do I put that web hook in that GitHub bot package? That's kind of weird. It doesn't really make sense to live there because that package is more for like the code reviewer or do I like centralize it. Um, so that's something that's on my mind of uh making that decision. I think the other one we touched on earlier is kind of the harness in the box versus out of the box. Um I think longterm the architecture will eventually come back out of the box. Um some of the newer tools that I've added are calling back into the control plane so that you don't have the secrets in the sandbox. And so I think longterm I probably will pull the actual um agent out of the box. But uh I think for now it's fine. Just um a quick question on pulling the agent out of the box. Um I I'm one thing I'm very bullish on this year is agents calling other agents or spawning sub agents or whatever you want to call it.

Sub-Agents, Multi-Agent Orchestration, and Meta-Devin

37:56 · Does that make it harder or easier? I can't tell. Because if the harness is in the box, you can spin up more boxes.

38:02 · Yes.

38:02 · the harness is outside the box then you're it's less easy because you are you have a you know a unicorn pet of of a of a harness that's like living outside the box I mean in theory it would be the same way right whether uh one agent has launched many uh sub sessions within it um open inspect for example can launch sub sessions and actually create other environments and then monitor them in the case where it is out of the box uh that would basically just be an additional session session that's running. And so that session is also running outside of the box. It's running in your worker plane wherever you're running this. Um, and then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex just because again you have now a more difficult architecture, but I think if you figured it out once, uh, it's probably fine.

38:57 · Yeah. Walden, I'm just uh throwing it open to you in terms of like uh I call this kind of like meta devon management.

39:04 · Yeah.

39:04 · Which is like the Devon calling Devons or Devon scheduling Devons or uh querying trajectories or anything like that. Um what have you built or unshipped anything? I think one of the surprising things we've seen is that a lot of the ways that like these like separate agents work with each other and you want them to like paralyze their work has still mostly followed the same manager sub agents regime and a lot of people I think are excited about this world where you have swarms of agents that kind of you know talk with each other all over the place. We've actually given Devon an MCP so they can just go arbitrarily message other Devons and create [snorts] new Devons etc. But I guess like it somehow creates like a really chaotic world in that sense and so we've still found that most practical use on a day-to-day basis has been one single engine figure out how to segregate the work and get have other devs work on it in like a relatively isolated sense each with their own boxes not sharing machines. So there's like elite very little room for conflict is kind of the the regime that you have to create today.

40:20 · I'll call out um the experiments from cursor, right? Uh this is Wilson Lynn's work on single agent to multi- aent and you're obviously famously on the side of don't don't build multi- aent but they went through the whole thing uh only to arrive at uh this [laughter] which is exactly what Devon has. Do you think I think there will be a revision to that post at some point about tell um I think multi- aents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can kind of argue are they really multi- aents or are they just kind of like tool calls, you know, like there are people who um will create sub aents to go look for XYZ file, XYZ implementation has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There's a lot of benefits to doing this. We basically have Devon do this with Deep Brookie make a call out to Deepookie give you back the results. But that feels like a tool call. You know, it's not like these like two collaborators actually talking back with each back and forth with each other. But I think the thing that gives me the most bullishness that multi- aents might actually be possible is actually what I said earlier about Devon will actually sometimes tell me I'm wrong and push back. And I think that demonstrates a level of maturity and communication today that makes a multi- aent world possible. like when can two agents who have seen different information come back to each other and actually figure out who is right what what is the correct implementation they're not just you know yesmen claude I guess it's like used to just say like you know what is it you're you're right or uh you're absolutely right you're absolutely right yeah um the have you seen did you see app troll and topic this is the codex app um inside of settings there's a little uh there's a little Easter egg right so if you go there like their their themes or appearance, right? There's all these like color codes and the top is absolutely and it's in topics colors [laughter] which is such a troll. Anyway, I love that Easter egg. Did you discover that yourself?

42:24 · No, it's like someone someone was uh tweeting about it and I was like I was like, is this true? Cuz like sometimes people just tweet stuff to uh get a rise out of you. But um yeah, there you go.

42:35 · And colors.

42:35 · Yeah.

42:35 · [laughter] Um, but yeah, we're we're out of this regime where, you know, it just says you're absolutely right and they can have real conversations and real back and forths.

42:43 · Yeah, you can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah. To me, that is more intelligence, right? Like that that is not just something that's like a dumb tool. It's actually pushing back on you.

42:54 · Yeah. Um, one you mentioned that question blog posts, there was one blog they had where they fed a swarm of agents together and built a browser.

43:04 · Yeah.

43:04 · Yeah, that that was I think that was the one you can have like I think it's the same one loop. Yeah, we found a surprising success of like don't do a swarm or anything. Just have one devon, you know, it does it own context convention. Just let it keep running for a while and give it some crazy tasks. I think we asked it to like rebuild like a Windows OS system.

43:22 · Yes.

43:22 · And it managed to do it just like you know going on for long enough.

43:26 · Was this Andrew's thing?

43:27 · Yeah.

43:27 · Yeah. Okay. Um there were lots of demos that we ended up not posting because at some point we' just be posting way too much. a bunch of like demos. But I love that because it kind of shows that I think the multi- aent thing still has like a bit of exciting sexiness to it which is maybe still beyond still like the actual delta it adds to the capabilities of these systems but it's absolutely the future.

43:51 · I think you know we're heading that direction and and we can see the progress being made there already.

43:56 · Yeah.

43:56 · If I were to uh make one super minor push back because I don't feel that confident about it yet, but I've I've had Ryan Lopo from Open Eye on on the pod and he's a super slop cannon, right? Oh my god, that's my coding agent being done. [laughter] I I I downloaded this like Pon Ping. I don't know if you guys have heard this.

44:15 · It it takes like uh sound packs from popular games like uh Command and Conquer and like Warcraft and then it plays it whenever it's done. So it's like work work or whatever like at your commands or something. Anyway, what I got from from the cursor codebase and from from Ryan's thing was that there's a slot cannon approach where you try to loosen the single agents uh bottleneck and I feel like that is like probably an very important thing to try to figure out. I don't think anyone's like really solved it because then you just have more reviewer stop on top of the agent stop to to try to wrangle it all. Uh Ryan will probably like strongly object that I I say that he hasn't solved it. He thinks he thinks he's completely solved it. But like I think it's still I think it's like very important because like that is a bottleneck, right? I I feel Devon is slow sometimes because I'm like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just like I want a button to just say like just ramp this up 1,00x par in parallel and just like see what happens, you know, and like I don't know if that's like it's feasible at some point in the future.

Vibe Coding, Auto-Merge, and Codebase Decay

45:25 · Yeah.

45:25 · I and we've also run experiments internally where we've basically tried to build entire products um like true products that we knew we eventually ship but for now let's try to see if we can do it just by purely like vibe coding on top of each other automerge no code review at all and then there's this kind of benchmark of how many weeks can you go onto this for before you say we have to trash this codebase actually reed it from start yeah yeah what did you find I think we found that the state-of-the-art in December was you could probably run this for about two weeks. By the end of those two weeks, you'd find that, hey, you want to like change a color of a button. Well, turns out this button is implemented in like 10 different places and like all these different variations and oh, you forgot one of them and actually it's a slightly different color one one spot.

46:12 · Okay, this is too much to work with.

46:14 · Let's actually try to do code review at the same time and make sure that we're on top of our stuff. We're actually cleaning it up a bit and making sure it's done in a scalable way. Yeah, I think building on that um the idea of like you don't have to look at code I think is generally a bad idea and the the meme that I have headline is is do you think that statement will be true on uh I think probably for a while it'll be true that you should continue to look at your code. Um, a problem that I see a lot of teams run into that I work with who are embracing kind of AI native, uh, AI first coding is the meme that I have is that your codebase regresses to your worst engineer because that engineer who is, you know, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code and now the AI is referencing their patterns. And so now their if else block that you know is 20 if else's back and forth. The AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point a pretty good approach to that is having scheduled cleanup whether by humans or through systems that are looking for duplication. They then address that. um you know you'll end up with like 12 helpers for how to format a date and you need to address that because otherwise it will continue to sprawl within bounds. I think it it's fine to have some duplication and then sometimes they have garbage collection, right? Um yeah, the what I've been uh talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules and it's your job as an architect, as a CTO, whatever to say like, okay, here's the hard contract between you guys and you guys. Whatever you do inside this black box is your business. you do whatever, but like between these guys, let's be like really damn clear and any movement must be signed off by a human or or me, you know, then and like that's that's that. I don't know if you have any other modifications or advice.

48:14 · Well, I guess genally on the topic of like where humans can be useful. Um I found that some of these like really deep info problems, sometimes just having a human that just has like really deep expertise can make a big difference. I've actually seen this come into play when actually building agents.

48:32 · So we we've had a few friends now uh try building their own co coding agents and I think one same problem that I recurringly heard a lot of them run into was this problem of like oh like you know Gp is really slow on our agents machines and so a lot of them um I assume because they're using AI and like they themselves don't have like super deep infra background knowledge say okay we're going to go build our own custom GPU index it's it's going to be really fast and use that as a way around this problem when we ran into this problem about like uh you know maybe like a year and a half ago when we were in the early days of building Devon um we obviously didn't have AI that we just asked like how how do you do this? You could just lock up a new [snorts] GPU index. So what do you mean you handcoded Deon?

49:17 · What? [laughter] Yeah. Like can you believe we we hand wrote this code? Um and and we we had a our info people who are really amazing.

49:27 · they were looking into it and they're like, "Oh, you know what? We realized that actually the root cause of this problem is actually super simple but like fine fine grain detail, which is that a lot of these virtual machines actually underlying them don't use real file systems. They use these like network file systems where things are actually cached over the network actually in S3. So when you're grepping, you're actually making network calls.

49:47 · Yeah. Every time you're doing these things and that's why GP is extremely slow on these machines. Um and so again goes back to you know what what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself you know there are tons of small details like this and so we had to eventually go swap out that network file system.

50:05 · Yeah

50:05 · I think there's a write up about it right so I listed one about the virtual that was a whole other thing the the thing the block diff file storage format which is uh a file system format that we built so that the VMs could be spun up and down very quickly. Basically the the intuition behind this is imagine you have like a terabyte of disk and your agent only like wrote like a 100 lines of code on top of that disk. How long does it say take to like save and rebr up that disc? Um and most systems because you're not optimizing for this case. Uh it's just like on the order of a terabyte of work because you have to like save all of that and bring it back up. In our system we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you're you're only doing work that's proportional to effectively the diff in the file system. Yeah.

Agent Infra, VPCs, Cloud Providers, and Fast VM Restore

50:54 · Um and so this, you know, shaves off a lot of time in in the boot up process of Devon. I think we this is actually now outdated. We have a newer system inside of Devon. But yeah, there's a lot of tiny details you have to get right here to actually get the day-to-day experience of Devon to be good.

51:09 · It's like not technically agents, but it is agent infra. And when you sell an agent as a company, you sell agent plus agent infra. Yeah. Um, at least the way we do it. And the other the nice thing about having the agent agent in front of being done together is, you know, you we kind of get to deploy Devon in whatever environment we want. Now we don't need to wait for some underlying infra provider to also go and support VPC or onrem or fed govcloud for instance. So we can actually go and figure out okay since we on the infrastructure how can we get that set up for you?

51:42 · Yeah.

51:42 · Where's your your Cloudflare dependent? Uh so Cloudflare runs the control plane. Um the sandbox is modal is supported. Uh contributor just added Daytona. Um E2B is on the road map and I think there's an abstraction in place that if any contributor wants to add a new provider, they can add that in.

52:00 · Yeah. Yeah.

52:02 · Amazing.

52:02 · Well, what are like how are the customers you work with? Do they generally try to then go set up a contract with another one of these third party providers? Do they try to do the VMs in house? Uh, most of them I see using modal. I think modal has shout out modal. Shout out [laughter] modal. I think modal has a great offering. It kind of captures all of the sandbox pieces you need. Snapshots being a pretty big piece of that. And given that they also offer GPUs, um, I think it's a pretty nice offering as a whole.

52:34 · Yeah.

52:34 · Uh, no debate there. Um, modal is great especially I think their their container offering is like the most natural and so especially if you are willing to you know forego like the full VM requirements modal is like a really vast place you can kind of spin something up on.

52:50 · Yeah.

52:50 · Is there a point? So models very python and I [snorts] feel like most workload like has really shifted to JavaScript. I don't know if you guys get the same feeling like so so okay when I started land space and AIE and all these things I was like 50/50 Python and JS right like that's roughly I think that's wrong now I think JS has won I don't know if you guys like maybe I'm overstating it and maybe for cognition know there's like C# and Java and what have you but like for like new green field apps do you feel that do you get that sense does it matter I think that most of the libraries that I see in the space are Python native first especially in the observability space.

53:29 · Um that said I think that there is a pretty big appeal of having your entire system in one language especially when you have both your front end and back end communicating you can have one central type which is very nice.

53:41 · Yeah

53:41 · that's my case against modal because then you have to run I mean you can run JS inside modal it's just like one extra step that like isn't native to the runtime. Um, I don't know if Yeah, I don't use Do you have numbers? I don't know. The one thing I don't like about Python is whenever AI whenever writes it's Python, it always does like the weirdest patterns that because it's like mixing two and three or what?

AI Code Smells, Reward Hacking, and Code Review Systems

54:04 · Yeah, I think it's something mixing two and three. Yeah, like the I don't know if you see this. It always tries to do like has attribute on objects. It's like yeah, but that you shouldn't be doing that.

54:14 · Like it should error if because it's training on library code. I think it's more of like from what I've seen it's more of like a reward hacking mechanism where it doesn't want to basic yeah it doesn't want the code to fail and so it even when it knows it has the attribute it'll call get at on it um and for a lot of my clients who have moved towards more kind of autonomous coding we've put that in as a lint rule that if you do get after your pull request is going to fail. Oh, this is a fun topic.

54:43 · Can you tell me more like this? Like what else is AI is like a sign of AI uh coding that you have to put guards in.

54:51 · So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like you know it'll like comment on every line, but it'll write like paragraph like PRDS like on top of every function. Um, but I will say to its credit, these aren't slop, you know, descriptions like they were before, okay? Like, oh, here's what this function does. It's like, oh, here's actually the reasoning and why we chose this approach and what the alternatives were and why we shouldn't do those alternatives.

55:22 · Like, still too much information. But I wonder if this actually might be directionally correct if you want systems that can self-maintain themselves in the long run. Oh, they write their specs in line context in the code as well. Yeah.

55:37 · So, you approve I but at the same time it's this tricky problem like maybe we'll just give our users like a setting or something uh for like how verbose you want it to be. I I haven't loved it like I just I like the comment but please like get rid of it.

55:53 · Yeah. Yeah. Yeah.

55:54 · But I could I could see a world where maybe something of the sort becomes reality. I don't know if you guys know about Git AI.

56:01 · Yes.

56:01 · Yeah.

56:02 · We've talked about it. Yeah. Git AI the idea behind it is that if you run an agent the actual prompts you send to the agent should be stored alongside the code inside the git metadata so that [clears throat] future agents can reference it maybe code review bots can reference it um and it's kind of an ideal world where you know your context for why decisions are made constantly lives aside beside your code and so it's like maybe a more hidden version of this like write massive PRDS for every comment sort of approach.

56:31 · Yeah, I'm waiting for the real bullcase where we just get rid of git altogether. I we're I'm not I'm not there yet, but I'm looking for it because that would be a big shift.

56:41 · Kind of on the topic of like visible slop, a pattern that I see a lot of cross GPT models specifically is backwards compatibility um at all costs where it's doing these weird import exports so that it doesn't have to modify uh the names of where the modules were. And uh I've seen claude 4.6 six starting to do this as well.

57:02 · Oh no.

57:03 · And again, I think it kind of is this like reward hacking behavior where it doesn't want failure to occur. And you can address that through like SEM grap or other tools where that behavior is pretty easy to identify. Um but it's something that you kind of only learn through the trade of just seeing code patterns. Yeah. Um untyped tupils are a really big problem of just like again just throw any in there like dick string any and again you can address those through linting.

57:31 · Awesome. Yeah. Any other so like linting any other tools devon review of course uh not so not so free now but you know still use it.

Making Codebases Agent-Ready

57:40 · So one thing that I think we try to recommend teams as they use more AI agents it goes back to this like local testing thing. In the end of the day, you want your agent to be able to do the full thing, not just write the code, but actually run it and test it. And a lot of code bases were not necessarily built for this from the start. For example, you probably do want a local DB setup, a local docker compose and and Postgress in order to have it so that you don't need to give your agent any crazy product credentials to actually run and test its code. We've also internally done a big shift to make a lot of our core uh components of code testable as purely local dev without needing to actually like integrate with uh any live services for this reason. And obviously the older the company like the more you have to change to shift in this direction but you know you can use AI to help you perform this migration.

58:32 · The older the older the company the more you have to change in order to do local dev misunderstanding. So, so you're saying most most people just build with full integration to other stuff and there's no code path to switch it to local especially in like when there's like lots of different services and you have like microser architecture making that shift the larger the codebase that the harder it is. I guess if you did build it correctly from the very start think possible but also like a lot of companies in the world that got started before Docker was a thing and so yeah you're kind of forced to make a migration at some point. Well, Devon's good very good at making mock servers.

59:07 · Yes.

59:07 · Right. So, uh, and know what one of the projects I really wanted, it's like like little snitch. I don't know if you guys have heard of I run little snitch on my computer. There's like a man in the middle, but it it like shows you all the traffic going back and forth.

59:20 · But then from there, you can sort of reconstruct the server, right? And then and then like create local mocks, so you can local mock everything if you just observe traffic for a little bit.

59:28 · Y, that's an interesting idea.

59:30 · [laughter] Um, cool. Uh I I I don't know if this will get anywhere, but I wanted to maybe talk a little bit about the cloud code um uh leak because usually if I have an enthropic person on uh I can't talk about the cloud code leak. [laughter] Did you guys learn anything from cloud code?

59:48 · Uh our team was not that uh interested in that leak. We didn't spend that much time on it. But I'm just I'm just fishing for uh No, I didn't really uh research too much into it.

Windsurf 2.0 and the Local-to-Cloud Agent Handoff

1:00:01 · Fair enough. Okay, one more last thing before we go. Uh, Windsor 2.0, you guys shipped another thing. So, the sort of meta context is you use background agents enough sometimes you're going to want to bring them to foreground and like that that little like handoff from local to cloud is hard to work on. Um, and then and Devon has or Cognition has just done it.

1:00:20 · Yeah, I I think for me the biggest um gap this is trying to close is again how do you make the testing process as fast as possible? when it can test on its own and send you a video, it's freaking magical. Sometimes there are just really difficult things you can that you do just need to like pull down locally to test. And you know, we just want Windsurf to just kind of be your like local command center of all your agents, like your your background ones, your local ones. And you can imagine, oh, okay, this agent needs me to review something. I'll pull that down, move my other agents to the background, go test it. Okay, boom, done. On to the next one. Right? Um, you have some issue you got to fix in the background, just click like approve. Okay, sort of start a background agent to go fix it. I'd love a world where I'd have to leave this window, you know, then maybe the other window I got to figure out how to stop spending so much time in Slack, but maybe, you know, someday we'll want to get those two as well.

1:01:08 · Yeah. And like does that require the binaries to be exactly the same for local versus cloud?

1:01:16 · So the funny thing here is that the behavior between local agents and cloud agents, I think that is actually a bit different in their ideal state. I think local agents, you want them to be a bit more fast and let the user make the call on things. Actually, don't try to autonomously go test things. The background agent mode where you go start it off, I think the agent should just assume the next message I send the user should just have everything that the user needs for me and you know not run and stop keep running and don't stop until you have the testing until you have.

1:01:49 · So that's that's just a slightly different prompt.

1:01:51 · Yes.

1:01:51 · But for many reasons because of all the work we do to make sure that devon works with different git providers uh that it works with different like oss and VMs like we want as much of that logic to be shared as possible. So for our own practical purposes we try to share as much of it as possible.

1:02:06 · Yeah. Yeah. Um I I mean I I can't imagine how much work it is to you know transition back and forth. So congrats on shipping this.

1:02:14 · Thank you.

1:02:15 · Okay. Uh anything else that we should cover before we uh wrap? um just whatever you guys were talking about in your lunch.

1:02:21 · Um [laughter] maybe like use cases like what are like do you find to be like the biggest things that your your clients are trying to do with their cloud agents today?

1:02:29 · Do you want to just ask it again so we can get like a clean cut?

1:02:32 · Yeah. He was drinking his water. Yeah.

1:02:34 · Yeah.

1:02:34 · The thing I wanted to talk about was use cases. What do you think are the main things that your clients come to you today about? Hey, this is why we want to go set up cloud agents. Yeah, I think the easiest and most common use case I see across everyone is S sur use cases. Uh the idea that whether we have our alerts in Slack or Data Dog or wherever they're going, we want the agent to be the first responder on that.

SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases

1:03:00 · And that doesn't necessarily mean that the agent is actually resolving the issue, but just being able to collect that context ahead of time is huge because again that agent is integrated into the production logs, the database, it has full visibility and over time playbooks as well for how to address certain issues. And so that's a huge win for teams because instantly you can have a full trajectory of what is going on within the system and oftent times actually a pull request directly from that which is a pretty neat flow to actually experience of like error pull request done. Um open inspect does support a trigger for that as well. So that could happen completely autonomously from data dog specifically or just uh it supports sentry. It supports a generic web hook and uh if someone wants to add data dog they can.

1:03:46 · Yeah.

1:03:46 · The other use cases that I see um are for kind of non-builder use cases, whether that's the PM or the marketing team. I'm seeing a lot of uh teams where the idea of who's actually contributing code is starting to change. And in a lot of cases, uh the PM, if there's just a quick bug fix, the PM is not creating an issue anymore. The PM is just prompting through Slack and the pull request is then being created. And so I think that that's a huge win. I I think that that trend will continue where we're seeing uh code modifications happening outside of engineering. The last common use case that I see is customer support and so where they're experiencing an issue with a customer. They're not entirely sure why this behavior is happening.

1:04:32 · Previously that world was, hey, there's a bug when they tried to use this feature. We don't know what's going on. Well, they're now tagging that in Slack.

1:04:40 · Again, that entire full context is ready. they can then just tag in engineering and have a complete understanding of that issue and completely bypass kind of the the previous pain points of like oh can you get more information from them the only things I'd add on top of that I think I've seen is like continual security scanning continual security review is a very big one as well the s use case internally we think about it as auto triage because we just want every message that comes in and that's an alert that's a bug report to have Devon just start triaging before anything else. And um we've leaned into this use case so much so that we've basically tried to make it so that you don't ever have to leave Slack to interact with this. So again, making the interactions with Devon super fluid from the moment the report comes in to it responds to report and be able to ask it questions right there with full codebase context about all the issues. Very related to customer support as well. I think one thing that we found is CLIs can sometimes be like very difficult for people who aren't technical to go and use but uh you know an online chat interface that anyone can go and ask questions and it's super intuitive and doesn't assume you have any technical knowledge but does have access to all parts of your codebase super useful for support for salespeople um anyone who might need to have their questions answered about the codebase. Yeah, great call out. This might potentially be like a very expensive u use case. Is there like a rule sense a rule of thumb on like how much people should spend on this because uh you have unlimited budget but like not other people don't you know like I don't know if this is an answerable question because obviously it depends on like a lot of factors but like I think it depends really on um how people are using it. I think if people are using it responsibly and they're getting value from it, then you know you can kind of determine the budget. Common numbers that I hear are anywhere from a,000 an engineer up to 5,000 an engineer.

Agent Budgets, Hybrid Models, and Autonomous Coding Factories

1:06:37 · Yeah.

1:06:37 · I have not heard anywhere in the realm of like 50,000 an engineer for a frame of reference.

1:06:42 · We'll get there.

1:06:43 · Yeah. I I've seen I've seen numbers go that high for sure. Yeah.

1:06:46 · Um I think that this is also I think going to be a big theme of the coming year is we're going to see very expensive very smart Frontier models and we're also going to see people who say you know what I don't need the Frontier anymore for a lot of the work I do because some Frontier models actually are good enough for a lot of the work.

1:07:05 · Um also shout out you pioneered smart friend which which is a mix. I'm I'm really interested in a world where you basically have hybrid frontier and subfrontier systems where you use subfrontier part to be really fast, really efficient and call out to the frontier part of the system so that you can still get frontier performance for the most part.

1:07:24 · Yeah, I'm trying to search but Twitter search is like completely broken. Like I like the the the from field is just completely gone. It's very sad. Uh because I I really No worries. Um I I I might have to make a a new post at some point about the the return of Smart Friend.

1:07:40 · Yeah. Yeah. I mean, Anthropic is now officially adopted.

1:07:43 · Yes.

1:07:43 · Um Okay, cool. Um I think that's it. Like it's really great discussion. Great having you guys on uh background agents are a thing now and uh everyone's building them. Uh we talked a lot about like the uh the production concerns and like you know why you would want to offer one architecture over the other. Um, yeah, lots lots to look forward to.

1:08:05 · Yeah, there's a real zeitgeist in the space right now, I think, for companies to want to drive themselves into these autonomous coding factories and uh yeah, you know, we're doing a lot to try to support that and so you know um any listeners are welcome to to come chat to us about that whether using Devon or you know working with us.

Hiring at Cognition and OpenInspect Consulting

1:08:21 · Yeah. Hiring.

1:08:22 · Yes, of course.

1:08:23 · Uh what what uh specifically you know just like give give like one profile that's like very interesting. I think people underestimate the the role of like really high taste product engineers okay in the space right now and the test is like what have you shipped end to end that is tasteful product if you've shipped stuff that you think is tasteful and and you're proud of you know you should you should come talk to us yeah for me um any businesses that are looking to further their engineering org a lot of the consulting I do is around that teams who are maybe starting their AI journey whether that's with cursor or cloud code. Um, but they're looking for someone to kind of help navigate them through the state-of-the-art and beyond just that initial deployment. Um, as mentioned, there's a lot of lift from you've deployed the background agent to how do we actually get this fully integrated into the company and really realizing the true value of that.

Outro

1:09:15 · Yeah. Okay. Well, thanks you guys for coming on.

1:09:17 · Cool. Thanks for having us.

1:09:18 · Yeah. Thank you.

1:09:21 · [music] [music]