---

Transcript

Early career

**** · Should I consider multiszone, multi-reion or even a multi- cloud setup? How much availability risk are you willing to take on versus the computational overheads, but also the human overheads designing and operating the system? Macro produce is dead. Nobody uses it anymore. But other areas where we've increased the coverage are systems in support of AI vector indexes. Is there any risk as a software engineer that you're no longer incentivized to understand the underlying layer? If you rely on a higher level abstraction, you're no longer thinking about the lower level details. If you're building higher level business logic, I think it's just fine. LLMs increase the need for these formal proofs because we're vip coding a bunch of stuff. The reason I think that formal verification could become more important in the future. One is that designing data intensive applications has been the go-to book for anyone building large backend systems. 9 years after publishing this book, the second edition is here. Martin Klutman is the author of this generational book. I sat down with him and today we cover how working on CFKA at LinkedIn directly shaped ideas that became the first edition of the book, what's new in the second edition, and why things map produce got removed from this updated version. Formal methods, local first software, decentralized access, and many more. If you care about how large systems work, where they're heading, and what the fundamentals are that don't change, this episode is for you. This episode is presented by SATSIC, the Unifi platform for flags, analytics, experiments, and more. This episode is brought to you by Sonar. Sonar, the makers of Sonar Cube, understands that code quality is about more than just avoiding syntax errors. It's about long-term maintainability by protecting the structural integrity of the system.

**** · As agents generate code at massive scale, they often ignore your system structural integrity. This creates tangles, duplicated code, and other maintainability issues. These issues turn a module design into a big ball of mud, making it increasingly difficult to extend. But here's something that's really helpful. Sonar Cub's architecture management. It moves architectural governance out of static wikis and into your automated workflow. It allows you to visualize your current architecture, define architectural boundaries, and manage architectural issues in real time. Whether it's a human or an AI agent at the keyboard, Sonar acts as a circuit breaker for structural decay. It ensures every commit respects the systems blueprint protecting the long-term health of your most complex applications. Head to sonarsource.com/pragmatic to find out more. So Martin, welcome to the podcast.

**** · Hi Ger, it's great to be here. It's amazing to have you here. I don't think you need introduction to many software engineers, including myself.

**** · You're the author of this iconic book that I've had on my bookshelf for probably about 10 years, not much longer after it came out. Before we get into this book, which we're going to talk about, how did you get into the technology field?

**** · Yes.

**** · Well, I did a undergraduate computer science many others.

**** · And then after that, I wasn't quite sure what to do with my life, but I thought, well, is starting a startup seems an interesting thing to try. So, I started a startup having no clue what I was going to do and then spent the first while searching around for things that might be interesting. it the first startup didn't work out that well but through that I met some others who then became my co-founders for the second startup which worked better and we sold that one to LinkedIn and then after that I started being interested in teaching these distributed systems concepts so that's when I got into writing the book and then during the writing of the book I also switched over from industry back to academia can we talk a little bit about your first and second startup yeah go test it this was 2008 or something that. It was the age where people were having really difficulties getting their JavaScript working cross browser. Internet Explorer was still pretty big at the time. Chrome had just come out. all the browsers were incompatible with each other and so Go Test. It was a cross browser automated testing service for websites was based on Selenium, an open source project that still exists. And the idea is you would write test scripts that automate the a user clicking through the various interactions with a website and then just check that the behavior happens. And so yeah, it was based on selenium but just as it provided as a hosted service so people wouldn't have to run various VMs with various operating systems themselves. It worked technically but I found it really hard to get adoption for it. A lot of people building websites in theory said oh yeah this is great. we need to test cross browser and in practice it was really difficult to get them to integrate it into their workflow and just get in the habit of using it and investing in writing the test scripts. So, so that ended up not really going anywhere.

**** · So, so there wasn't a business to be done or revenue to be generated in meaningful sense.

**** · Yeah.

**** · Well, there's at least one other maybe two other companies from that same era that did manage to make a business.

**** · Source Labs is one that managed to succeed. but it even for them it was a pretty slow running business. I think it was not an easy business to be in. And for the startup, were you in the UK building it?

**** · I was in the UK at the time.

**** · Was it was it bootstrapped? Did you raise some funding? How big was the team? How can we imagine this?

**** · It was mostly bootstrapped. So I did a bunch of consulting in order to fund hiring some people and then hired some friends on the cheap to help contribute to building the product. And so it was done all very cheaply. I had a very small amount of of angel money in there but mostly bootstrapped.

**** · Mhm. And then when you decided to not go forward with this, how did the next startup come? reportive, ### Building Rapportive

**** · Yeah, the second one was reportive. That went a lot better. So, that was putting social media inside Gmail So, the idea was that if you get an email from someone you don't know, we had a little browser extension which manipulated the Gmail web interface so that on the side next to the email, we'd show you a summary social profile with a profile picture and a job title pulled from LinkedIn and recent tweets pulled from Twitter and maybe recent Facebook post or things that. just whatever we could find about that person and put that as a as a social summary next to the email. We started in 2010 or something that. It was then pretty quickly became quite popular. and so on the back of that we were then able to raise some money from my combinator which was still fairly young at the time.

**** · That was very young. That you must have been one of the very early batches.

**** · Yeah, I can't remember exactly when they started but it was it was certainly in the early years. I think Y Combinator had already built up a quite a good reputation at the time, but it was still fairly small.

**** · And then as part of Y Combinator, did you have to fly you from the UK to San Francisco to attend that 10e program if I remember?

**** · Exactly.

**** · Yes. So we initially came for the 3 months or whatever it was of the Y combinator but then we were able to get US work visas for ourselves and set up permanently in San Francisco.

**** · How was that shift from the UK where you spent going to university your first startup the first part of this to coming to San Francisco? It was very exciting because it felt going to the center of where it was all happening really and we at the started out not knowing anybody at all. we knew one or two people in the entire Bay Area, but we contacted them and they introduced us to more people and they introduced us to more people. And so we were able to pretty quickly build up a network and that's something that I really appreciated that it was so open to outsiders us who could just turn up with an idea and an early stage startup and we managed to raise some money and managed to become somewhat established in the in the Bay Area. And can you tell me how the how the company grew and at what point did the LinkedIn acquisition offer come and how can we imagine even you were a founder of this company.

**** · It was about in 2012 that we sold it. and we were five people at the time. So it's all still pretty small. not vast amounts of money involved but it was a success I would say for everybody involved. The acquisition process it itself was fine. is as always with these kinds of transactions, there was twists and turns and moments where we thought it would all fall apart and then we were almost running out of money and hadn't really succeeded in raising another round. So, we had to sell or shut down. So, we were under quite a bit of pressure. We couldn't reduce our own salaries because to do so would have violated the conditions of our visas. Yes. so, we were in a slightly stuck situation given our lack of leverage in that situation.

**** · And I'm pretty happy how it all turned out.

**** · Yeah, it's nice that for 10 plus years we can talk about this honestly because often times you see an acquisition by LinkedIn and of course you might ask the founders and they would say this was our either our dream or our goal or we will do so many things together but some things that you don't often hear is well that there was a pressure involved as well. So, did you go into this wanting to sell the company because you saw that things were getting a little either you needed to raise a new round or you sell to someone and then you found LinkedIn to be the best of or the only or or the best option to go into. We tried a little bit to see what revenue generating options we had and hadn't really managed to make that work. So, we were just burning money and and our user growth was okay but not really enough to go and raise a big round. so we were a little bit stuck there and selling the company seemed the least bad option there in a way. And I'm pretty happy how it turned out because LinkedIn was great They were very good to us. They allowed us to operate as essentially a independent team within the company.

**** · So your team stayed together?

**** · Our team stayed together. We continued working on the product that we wanted to make.

**** · Oh, you got to keep working on reportive.

**** · Yes.

**** · Well, so report of the Gmail browser extension got put on life support, but we were working on a new product at the time, which did eventually get released under the name LinkedIn intro. It got a slightly weird reception at the time and it ended up getting shut down shortly after we released it. this longer background story there, but I'm still really happy with LinkedIn how they gave us the freedom to do this and allowed us to launch this product and even though it didn't succeed, they were very good to us throughout that process and then after that got shut down then our team got disbanded. but we had a good run within LinkedIn building this product. What tech stack did you work at the time which what do you use? The reporter was fairly unexciting. It was a Rails app with a Postgress database and some Reddit and some similar things that mixed in. So nothing particularly revolutionary. We essentially built a graph database on top of Postgres. So there was a little bit of technical interest in there but nothing particularly outrageous. And then you spent time after LinkedIn intro you still work inside LinkedIn as I understand you worked on data infrastructure ### Working at LinkedIn

**** · Yes

**** · data infrastructure. after our team got disbanded, I switched over to the stream processing team. So CFKA had just been developed at LinkedIn and had just Oh, it was just being open sourced.

**** · Yeah, I think it had just been open sourced and then I got to work on samsa which was a stream processing framework on top of Kafka. I always wanted to ask this question so this comes here. Why did LinkedIn build Kafka or develop Kafka? every time it's now such a fun foundational technology there always I was always curious why did a company feel the necessity to build this thing that seems pretty generic and it seems everyone would have needed it.

**** · Yes. So I think Jay Kreps has a pretty good blog post from that era called the log where he explains his motivation behind CFKA and why make it an appendon log rather than a traditional message Q or something that. I think the mo motivation was really about data integration because there were a whole bunch of databases and event generating systems activity events from users for example they were all generating data that in a stream shape and then a bunch of downstream systems that wanted to consume this wanted to get it into the data warehouse and wanted to be able to get it into the Hadoop cluster at the time in order to run machine learning and things over it and there was just this data integration problem of how do you physically get the data out of one system and into another and Jay designed CFKA as this integration point essentially the almost the lowest common denominator but still a general purpose abstraction for integrating v various data sources and to downstream data syncs working at LinkedIn at CFKA and at LinkedIn scale what did you learn or what surprised you about working at this type of scale as I understand this was for the first time that you hands-on worked at a really large system, **** · That's Yes. Because previously the biggest company I had worked in was Reporter with five people. We had a sizable database but it was still a single instance database and not really that big in the grand scheme of things. And then yet suddenly I was at LinkedIn and oh we got to get to use their big Hadoop cluster.

**** · That was fun hand coding map produce jobs in Java at the time and so I learned a huge amount there. especially when the stream processing ideas came up and Jay was evangelizing the use of CFKA and the things you could do with it. That was a revelation for me really where I suddenly felt this makes sense I'm I start to understand how these various data systems fit together what they have in common what the fundamental principles are and so that experience then fed directly into the writing of the book.

**** · At what point did you decide to leave LinkedIn? To me, in your careers, I'm looking through the career, start out in the UK, do a startup, do a second startup, Y Cominator, move to San Francisco, get acquired by LinkedIn. The arc that most people would draw would be, okay, do something more in Silicon Valley or maybe start a second startup, etc. And instead you decided to leave LinkedIn. Yeah. So, first I decided to move back to the UK and I continued working for LinkedIn remotely. Okay. That was m mostly because my girlfriend at the time, now wife, was still in the UK and long-distance relationship is not a lot of fun and I didn't feel that at home in the Bay Area. So, I wasn't really encouraging her to move to the Bay Area either. I thought it was better for me to go back to Europe and I'm very happy with that decision. I still have a lot of great friends in the Bay Area. I love it as a place to visit, but I wouldn't want to live here honestly.

Writing Designing Data-Intensive Applications

**** · Then I was still remotely working for LinkedIn and that worked for a while. When I then started writing the book, LinkedIn even gave me 50% of my time free to work on my book alongside my software engineering duties, which is really great.

**** · Amazing. Yeah, that is so nice of them.

**** · Absolutely.

**** · And there they don't have to do that. And LinkedIn didn't directly get anything out of it in response other than a book that they could use for internal training purposes. Well, shout out shout out to LinkedIn for this.

**** · Yeah, absolutely. Though then I did find then that trying to write a book in parallel with doing a software engineering job and being on call etc. I just wasn't able to do it. It's just too much context switching and it's very easy for the urgent things from the on call to dominate and then not to have the the freedom of that you need in order to write something new. and so then after a while I decided okay it's it's probably better if I focus full-time on the book.

**** · So I then left LinkedIn and just took a sbatical unpaid sobatical i.e.

**** · unemployment to just focus full-time on the book for a while and then it's only after that I even considered getting into academia. So how did the idea of the book come? What was a point where you decided you would write and in your mind what were you deciding to write? What was it already this book with with this layout or you had an early idea back then?

**** · I had an idea that it of course the final product ended up looking somewhat different but the overall goal I think stayed the same. So what I knew I wanted to write something that was a broad conceptual overview. So not about how you use any one specific system or tool but comparing the trade-offs between many different types of tools.

**** · And I knew that I wanted to be practitioner focused not a theoretical textbook but something that people could use to build real systems. That was the goal with which I appreciate approached it.

**** · And this was exactly the book that I wish I had when I was starting out and working at Reportive for example because we were all searching around in the dark where we're having performance problems with our database and we had no idea what to do because we were totally lacking the foundations to understand what was going on and how to diagnose the issues. And so I felt that well if I had a bit more background on how these data systems work internally then I could have had an intuition about how to debug these kinds of performance issues. And then after a while after I'd learned more about how data systems work I thought well okay it's it's time to write this down so that others don't have to learn it the hard way but can hopefully just get a better idea of how these systems work and thus be better at managing their own data systems.

**** · to start with how did you learn about for example how databases work because again from your story at report if you build systems you've had some performance issues at a smaller scale to be fair compared to LinkedIn then you worked at LinkedIn and you saw a little bit of how the sausage was made but I know a lot of software engineers who have been in this path and they still don't really know how the fundamental systems work they just know okay we have a platform team inside our company and they build it I could read the RFC's but it's a lot of work or the planning docs I could look at the source code it feels to me that even at that point you just went down and tried to dig in.

**** · What resources did you use? How did you find out those basics which you later put into the book? A lot of it was just being curious and talking to people and just asking them lots of questions. And at LinkedIn there were a bunch of senior data systems engineers who understood their stuff very well but hadn't maybe necessarily written it down.

**** · Mh. And so I just talked to a bunch of them and quizzed them and that way started building a an image in my own mind of how this stuff works. And then once I got the basics from these conversations, then I was able to go and read research papers for example. They go into much more detail of exactly how and why things are designed in such a way. but it is timeconuming to read those things. so then what I tried to do was pull out what are really the essential ideas. I just read a ton of blog posts as well.

**** · and so the reason why you see so many references at the end of each chapter in the book is well that is the material that I myself used in order to understand what was going on. And then I thought well okay well if I found these things useful then I'll also cite them in the book as a way for anyone any reader who wants to go beyond the basics covered in the book here are some good sources to further reading. Yeah, the structure of the book, this first book at least, it's foundational data systems, distributed data, and derived data. If I understood, these are three big parts. Did you already have a structure in mind when you started writing the book or did it shape as you went? This three-part structure is not that critical in the design of the of the design of the book really. That's more after the fact I thought, oh, well, it seems we can group the chapters into roughly this structure. But the topics of the chapters were more or less what I had envisaged. So I I knew that I wanted to talk about what a transaction is. I knew that I want to talk about replication. Knew that I wanted to talk about sharding or partitioning.

**** · Knew that I want to talk about consistency and consensus. Those the highlevel topics I think were clear from my initial book proposal to the publisher. the details within each chapter. That is something that I often figured out once I got to that chapter. So, I wrote one chapter at a time and started each chapter work with just a lot of background research to get up to speed on the topic myself. And it's often only then that say for then replication I decided okay well it seems the three major ways of doing this are single leader, multi-leader or leaderless. Okay.

**** · Mhm.

**** · I would decide on that structure at essentially when I started writing each chapter and then try to fit the various points I wanted to make into this narrative structure. As a as a fellow author who also wrote a book, one thing I've noticed there's a bit of parallels between estimating a book and estimating a software project in that you come in with a estimate and if you've never done it before you tend to be wildly off. How was this in your journey? And addition, you also had a publisher and publishers are a little bit project managers. They, they to have a schedule. They to try to keep you on track. They to ask what when is it done?

**** · How did you manage that part as well?

**** · And in the end, how long did you estimate it would take when you started and how long did it take?

**** · As always, it takes vastly longer than expected. It's the same for software and projects as it is for writing, I think.

**** · So I think it took me about four years to write the first edition and that was not four years of full-time maybe two and a half years of full-time equivalent or something that but written over the course of about four years. So it definitely took a long time. The publisher deadline I missed by a ludicrous margin. I think I missed it by about 2 and a half years or something that. but fortunately O'Reilly were pretty laid-back with the with the second with the first edition and were happy for me to just take my time and make it good. when it came to the second edition then O'Reilly got a bit more aggressive and pushy about sticking to deadlines. I guess by that point the book had been established and people were waiting eagerly for the second edition. So, I understand the desire to want to accelerate it, but at the same time, I really appreciated the freedom that I had for the first edition to work on my own schedule. and I had a bit less of that with the second.

**** · The tagline for the first edition, which I believe is the same as second edition, the big ideas behind reliable, scalable, and maintainable systems. Reliable, scalable, and maintainable. What do these objectives mean to you?

Reliability, scalability, and repeatability

**** · Yeah.

**** · So they're all slightly vaguely defined, So there's there's not a formal definition of those things. But for me, reliability means fall tolerance primarily. So meaning that a system should on the whole continue working even if a network link is interrupted or a node crashes or something that. So a lot of the book is about techniques that support fall tolerance replication for example. so that's reliability. scalability is one of those terms that gets thrown around a lot and it's so much and it's it's fashionable and cool to make things scalable, because it's it suggests success and millions of users and so that's of course everyone wants things to be scalable because everyone wants success for this book. here tried to take a bit more dispassionate approach and said scalability is just what mechanisms we have for dealing with changes in load if load increases how can we add computing capacity to a system for example so that the system still continues working and then the techniques that you use to achieve scalability well they are sharding for example and but in this case scalability your definition do I understand that you're mostly referring to horizontal scalability so they cannot compute up or down pretty much.

**** · Yeah, I guess because that's the more interesting one yes, you can always buy a bigger machine and what's interesting about that and exactly there's just there's not that much to be said about it. there are details of how you scale even on a single machine but I think part of what is become interesting about modern cloud services and just backend services in general is how they've introduced this idea of hor horizontal scalability and shared nothing systems. So we can build systems that are able to cope with very high load even if the individual components are just fairly cheap commodity machines. But maybe part of the scalability story which I wasn't thinking about as much at the time but started thinking about more recently is not just scaling up but scaling down as well.

**** · So how do you run a service in such a way that if it has a very small amount of load it's really cheap to run it. That's a in a way the same question as how do you continue running a service if it has very high load. generally you just want the cost and the computing capacity to be roughly proportional to the load that you have. And at the low end that means being able to scale down to something that is extremely cheap to run. And that's not so necessarily a given. That's something that is hard with on premises software for example because if you've got a machine a physical machine that's a unit of deployment and yes you could carve it up into two dozen virtual machines and make those small virtual machines but it still requires some resource allocation. So part of what's interesting about some serverless systems for example is their ability to scale down and say okay if you're going to handle just three requests per day that's just fine as well. Can you tell me about the second edition? When did the idea come about?

**** · Yeah, it had been clear for a couple of years that the second edition was needed just because the first edition was getting a bit dated. There were changes in technology that just hadn't been reflected in the in the first edition. So, I wanted to update it, but I now have an academic job. I'm doing research and teaching is my main thing, and updating the book is just a sideline business on the side in some sense. So it took quite a while to make progress with that because I was always doing it alongside other projects and essentially back to that context switching problem that I had while writing the first edition but just now with an academic job that I didn't want to just drop because quite enjoy it initially then I made very slow progress with the second edition and also I realized that I had slightly lost touch with current industry practices because I'd switched over to the academic side.

DDIA: the second edition

**** · I gone much deeper on the theory. but I was no longer up to speed on what people were doing with say data legs or things that. So then at some point it I remembered Chris Rkamini, an old colleague from LinkedIn. I had worked with him on the stream processing stuff. you work with him. He's he's the author of the missing readme.

**** · Exactly.

**** · Wow. What a small world.

**** · Yeah.

**** · And I had read Chris's book, The Missing ReadMe, and thought, "Oh, he's a great writer." And I had worked with him as a software engineer and found him a great colleague and also he had been writing this newsletter called materialized view on on latest trends in data systems essentially and become a startup investor in that space. and so at some point I thought, well, I have to get in touch with Chris and ask him whether he wants to help out with the second edition. And he was keen to do that. And that turned into such a good collaboration because he was up to date on what the cutting edge was in terms of technology in industry.

**** · I had strong opinions on how to teach essentially. So how to explain things in the book, make sure that we were explaining everything in a in a way that was very precise, very carefully chosen words, but at the same time very accessible so that it's hopefully easy to read. And so we took essentially my writing style plus Chris's knowledge of latest industry trends to bring the book up to date and that was a great collaboration. what are the big things that you added that and which ones of these you knew would be missing and which ones did you realize during the writing process that okay this needs to be in here now yeah so the thing we knew from the start that we wanted to reflect was cloudnative systems architecture it's it's a bit of a vague term but what with that is essentially building data systems on top of cloud services as the foundational abstraction in the first edition the assumption was that you have some machines.

**** · Each machine has some local discs. You can run a database instance on a machine. It will write its data to the local disk. If you want to replicate it to another machine, then well the database software will replicate it at the database level to another machine which will also write the data to its local discs. For a long time that was exactly the way computers worked. And now suddenly people are building databases on top of object stores for example. And now the replication happens at the object store level. No, no longer at the database level. or maybe there's still some replication at the database level but it really changes the nature of things if you're building on top of an object store and this is different from say building on top of a virtual block device EBS or so because these block devices although they are cloud services but they still offer the abstraction that is a single node operating system abstraction of a block device on top of which you run a file system whereas an object store is just a brand new abstraction it just looks different from a file system, it behaves differently.

**** · And so then building on top of that as a foundational abstraction is something that people were starting to do at the time of the first edition, but since the first edition that has really taken off a whole lot of system have been built in that style now. And so that's an idea that we really wanted to incorporate and we weaved that in throughout the book. So it's not just one section here. but it's it's a an idea that we've integrated throughout the entire narrative.

**** · There's now a lot of managed services as well. The per primitives that we use, but there's also so many managed services that all the cloud providers use and a lot of engineers, they often just use the managed services as is because they take care of replication. They have SLAs for uptime and so on. But when you build on top of these things and you use those as a as primitives as well, is there any risk as a software engineer that you're no longer incentivized to understand the underlying layer or are we building better systems because of that? How do you think about this? It feels there's a move of abstraction because of cloud, Yeah, it's definitely a shift to different and higher level abstractions, but that's been the story of the entire computing industry since the start. It's building new abstractions. So it is true that if you rely on a higher level abstraction, you're no longer thinking about the lower level details. And so it's you're using a programming language with a garbage collector, you're no longer thinking about memory allocation. And so is that a loss? Well, maybe. if you if you're building low-level systems, you should still have to care about memory allocation. You're building higher level business logic. I think it's just fine for people not to care about memory management. So I think there's an analogous thing here with data systems that if you're building the higher level systems that don't need to particularly care about the underlying infrastructure, then that's fine. Just use the higher level abstractions.

Tradeoffs of using cloud services

**** · Nothing wrong with that. But somebody still has to build those lower level abstractions from lower level components. Somebody's got to implement the cloud services. Martin talked about trade-offs that come with using cloud services. And this is a good time to talk about our season sponsor work OS. If you've read designing data intensive applications, that building system at scale is all about trade-offs.

**** · But one thing isn't a trade-off. That's enterprise features. The moment you land bigger customers, you need SSO, directory sync, arbback, audit logs, all the things they expect out of the box. Building that yourself can take months.

**** · Work gives you APIs to ship it in days so you can stay focused on your core product. That's why companies OpenAI and Antroic run on Work OS. Visit work.com to learn more. I'd also to mention our presenting sponsor stats.

**** · Static build a unified platform that enables both experimentation and continuous shipping. Built-in experimentation means that every roll out automatically becomes a learning opportunity with proper statistical analysis showing you exactly how features impact your metrics. Feature flags let you ship continuously with confidence. And because it's all in one platform with the same product data, teams across your organization can collaborate and make datadriven decisions. To learn more, head to stats.com/pragmatic.

**** · With this, let's get back to Martin and the trade-offs that come with using cloud services.

**** · And so those people will have to then specialize even more in the details of how you engineer those cloud services, how you make them reliable, how you operate them and so on. The skills are still there. It's just a bit of specialization happening that some people can worry about the higher level things without having to concern themselves with the lower level things.

**** · Some people focus on the lower level things and treat that higher level aspect as their customers.

**** · Interesting. So it sounds to me that if you're an engineer who is utilizing a lot of these services, you might not need to know how they exactly work.

**** · Yes.

**** · And I would say the underlying philosophy of the entire book is to give people insights into just the essence of how the systems work internally. So that if for example they start having weird performance behavior, you can have a bit of intuition for why it's doing that and how you might solve it. So for example, say the storage engine chapter tells you about how Bes work and how lock structured LSM trees storage engines work. And the book is not intended for people who are going to build their own databases and implement their own storage engines. If you want to do that, you have to go much more much greater depth than this book covers. But the idea is that as an app developer, if just a little bit about how the storage engine works internally, you'll be in a much better place to use it in a way that is that gives you good performance for example and to diagnose any issues. That philosophy we've kept also in the context of cloud services where yes, cloud service hides some of the operational details that app developers don't need to think about anymore, but they should still know a bit about how they work internally just so that they can use them effectively. I guess I argue about the trade-offs deciding on which service to use, which characteristics to look out for. Yeah.

**** · For your use case, Exactly.

**** · And they're huge differences of say if you're doing analytics whether you're using row oriented storage or column oriented storage. That's a bit of a technical distinction and it takes a little bit of background reading to even understand what that means, but it has a massive performance implication in terms of the final behavior of the system. And so those are those places where I feel knowing a bit about the internals is a superpower. Yeah. And I guess engineers the one thing that we always need to argue about or should need to argue about is at the very least cost versus performance. And by performance latency to the user and of course resilience of if something happens a region go a zone goes down a machine goes down zone goes down region goes down how our product is affected and what's acceptable. The basic idea there seems to be how much availability risk are you willing to take on versus the both the overheads in terms of the system itself the computational overheads but also the human overheads designing and operating the system and the cost overhead.

**** · Yeah, exactly. And so yes, you can have a system that is more able to tolerate various types of faults but it which is more expensive to to design and operate versus a simpler system that might go down a bit more often but which is cheaper. And there's no and wrong with that. it's a everyone needs to figure out where they sit on that on that trade-off space themselves. And I would say that multi-reion is pushing in the direction of higher availability because it means you could tolerate the outage of an entire region. But then it has implications on the consistency model that you can get across different regions for example. So that's a trade-off that the book tries to make very explicit to help people reason that through of what is the choice for them. In terms of multicloud, for example, one thing that I've been concerned about just in the last month really is European dependence on US cloud services.

**** · Yes.

**** · So what if geopolitics was to go horribly wrong and tensions escalate and Europe finds itself suddenly locked out of US cloud services? I hope that doesn't happen. I still think it's fairly unlikely, but it's no longer unthinkable. and as a result I coming from this European perspective have been thinking a fair bit about how can we engineer systems to be resilient against that thing and that's not just a regional outage but it's a business risk essentially and a multicloud sister setup could help mitigate against that risk so that at least for example if one company locks you out then you could still have systems on another company again that's very much towards the expensive but high availability risk reduction end of the spectrum. But for the people who have really critical workloads where they think this geopolitical risk is a significant enough risk I think it's seriously worth considering that setup. I'm thinking that as we do have the responsibility because who else will do this? Yes, totally. But I totally agree with you as well that this understanding what the risks are and communicating what the trade-offs are I think is going to be a core part of our role as engineers moving forward as well. Maybe as AI writes more and more code of our code, it's less about the details of how you express logic in a particular programming language and much more about those kinds of highle trade-offs. How has the definition of scale changed in this book? Because as we talk with cloud before cloud building a scalable system it sounded pretty involved because building a horizontally scalable system it's it's complicated all the pieces you need to put it in the first book you detail a lot of this with cloud a lot of the services they do define how they allow horizontal scaling what the tradeoffs are do you feel that it's made a lot easier to reason about scale scalability when you are using these primitives so I think achieving Being really high scale is still challenging because even though we have cloud services object storage for example which provide you this very elastic storage model at least you don't have to worry about capacity planning on your discs anymore and running out of disk space because those kinds of operational things they're taking care of but if you need sharding for example that's something that does reflect on the application code as well you can't really make that entirely transparent and so you're at a sufficiently large scale The charting is required because a single machine is not powerful enough to process your workload. Then I think even with cloud systems you still have to do quite a bit of engineering thinking of u of how to realize that where I think the cloud has helped quite a bit is at the lower end of scaling down. if you want to have a very lightweight service that processes only a small number of requests. what we've got with serverless systems being able to very quickly spin up and spin down an instance very lightweight that's quite a good innovation that has enabled those very low scale services and that's something that's would be much harder to do without cloud services because you would have to statically allocate a certain amount of memory and certain CPU resources to a particular virtual machine I love serverless I have a small website that runs on serverless and my bill is 13 cents per month because it has very little load.

How the cloud changed scaling

**** · Absolutely.

**** · It's just making more efficient use of computational resources. Let's talk about sharding. In the first book and when you wrote the first book when I was working at Uber, we talked a lot about sharding and there was a lot of internal implementations or interviews involved asking about sharding because we were designing systems that were sharding. I did sense that over time again as cloud systems start to become available that give you turnkey solutions more that act more platforms. You send the data and it takes care of these things. Fewer engineers have to implement sharding with cloud native systems in your research. What have you seen? What are the cases where putting sharding in place is still important and where are the places where it might have just disappeared as a as a concern?

**** · it's still nice to know but you might not have to implement it. I think it's probably less of an effect of cloud and more of just hardware getting more powerful that oh a big machine nowadays can do a lot on a big machine you if you and that means that more and more workloads you can just run on a single machine and that is sufficient to achieve quite significant scale already there's still concerns of how to efficiently make use of hundreds of CPU cores that you have on a single machine so there's still parallelism is still are a required thing to think about there and sharding is one way of achieving parallelism. But at least this sharding across multiple machines is maybe become less of a pressing issue just because more and more workloads can just run on a single machine. Some people still have very large scale workloads that do have to be sharded across multiple machines but it's not going away entirely and replication is still relevant even at smaller scales because that's for fall tolerance that's not for scalability.

**** · You have a chapter called the troubles with distributed systems which goes through a lot of things that can go wrong without going through the whole chapter. Can you recall some of the things that are memorable to you or some of the things that you feel are important to remember? Yeah. The whole idea of this chapter is that in distributed system theory there are certain things that we tend to assume.

The trouble with distributed systems

**** · for example, we just assume that there's no upper bound on how long it might take for a message to go over the network. So you send a message, it might arrive within a 100 microsconds or it might take 10 years and distributed system theory just doesn't make any assumptions about that timing if we can avoid it or rather some theory does make those assumptions but it's an dangerous assumption to make because occasionally the network delay does become much higher than what is typical. Another thing is about crashes. For example, the distributed system theory just says nodes can crash but what does that mean?

**** · what in practice does it mean for a node to become unavailable because it might be a software crash but yes it might be a hardware failure. It might be somebody unplugging the power cable. It might be that the node is still running but it's just become disconnected from the network. The point of this book chapter really is to defend and justify those theoretical models that we use for analyzing distributed systems and just giving a lot of stories and case studies that show that tons of stuff does go wrong and don't believe anyone who says oh failures are rare it's don't don't worry about it's fine. the the moral of this chapter is really that know if you want to make things reliable, you really do have to worry about a whole bunch of weird unusual but certainly possible edge cases. Timing is another one of those things it's very easy to assume that your clocks are correct and most of the times the clocks are pretty correct but we just can't rely on it because they're just not precise enough on the whole and so a lot of it is about it's very tempting to make certain assumptions that things are well behaved and in distributed systems we just have to try to get away from those assumptions if we want the systems to work reliably even in the face of things going wrong but it was a really fun chapter to Because it's it's essentially a big collection of stuff that has gone wrong. And so I went through a bunch of postmortems published by various tech companies for example in order to see okay what was the root cause of how things went wrong and what lessons can we draw from this that apply to the book in general.

**** · And there's some fun stuff the sharks biting undersea cables and damaging them that just makes for a great story. And then I hear that in recent years the shielding of undersea cables has got better and therefore the sharks are not biting them anymore. But instead the cows on land are stepping on cables and occasionally causing network interruptions that way. And that thing is just it makes it a bit more fun. That chapter is so interesting also because when depending on what teams you work on or what people you talk with when I talk with the S3 team for them that whole chapter is just their dayto-day.

**** · It's it's they don't it's not a weird thing when a hard drive goes up or there might be okay it might be a weird thing to have a fire in a data center but they're prepared for all of those things. They're at the scale where these things just happen on a regular cadence because they're one of the largest scales whereas at a smaller company even if you read this chapter and you will treat this as well this could happen but when it h when it happens it will be a once in 10 year and it will be a big deal. Yeah. But I think there's there's no answer. It's a it's a trade-off between risk and cost broadly speaking. And that's means a business decision has to be made in terms of where the business wants to lie on that trade-off. And so the goal of this chapter is really just to give people the information in order to make an educated decision. But I don't want to make that decision for people. That's for businesses themselves to decide. that's very clear. Have you come across some concepts or sips as mentioned in the book in the first edition and now in the second edition that are becoming either more popular or less popular over time more or less referenced by your readers thinking about from things streaming systems, batch processing or anything else? Yeah. So the some things that we've been able to take out out of the book compared to the first edition in particular for example coverage of map reduce was quite detailed in the first edition but map reduceuce is dead nobody uses it anymore. It's successors in the form of spark and flink for example they are used and so we still reference map reduce in the second edition but more as a learning tool in order to understand how these partition sharded batch processing systems work.

**** · So that's one thing where we've been able to reduce the coverage. but other areas where we've increased the coverage are, for example, systems in support of AI. And so, even though this is not an AI book, but there are still data systems concerns that arise when needing to support AI applications, a classic one is vector indexes, for example. And so, we've added some coverage of vector indexes to the storage engine chapter. Fit in really well there because it already covers various different indexing strategies anyway. and so vector indexes, it's just another indexing strategy. We also added some coverage of data frames, for example. That's not an exclusively AI thing. but data frames are quite a good data representation for training data, for example. And that was not one of the data models that we discussed in the first edition, but we decided to add to the second edition because it has become a very important data model that people are using alongside all of the classic data models relational and graph and JSON documents and so on. And so there these places where we've just expanded the coverage a bit to reflect the kinds of systems people are building for example to support AI without it changing the direction of the book entirely. The final subsection in this first edition the first few I guess sub parts were titled doing the thing and in the second edition this has its own chapter.

Ethics for software engineers

**** · The final chapter is doing the thing and I quote a little bit from it. We the engineers building these systems have a responsibility to carefully consider those consequences and consciously decide what world we want to live in. Can we talk a little bit about this section and the importance of it?

**** · Absolutely.

**** · Yeah. So the motivation for putting in an ethics section there in the first edition was that I just felt it had been quite ignored as a concern during my time in industry. that especially in startups people were very focused on building a product that their customers would love and really deprioritizing these ethical questions in the in the process.

**** · And so for example with the consumerf facing products it might be that the products are very much geared towards essentially data harvesting collecting behavioral data because that's what can be monetized in the form of advertising and there seemed to be just very little reflection on what was good and bad about these things. So I really just wanted to encourage a bit of thinking there. not really wanting to prescribe too much a particular approach there but at least to point out there is this thing such as data protection legislation now which we do have to think about in the architecture of our data systems and there is an ethical responsibility. pe people say that you get into tech in order to change the world. If you want to change the world, then thinking about the impact that your technologies have on the world is part of your job. It's it's a really essential part really and something that engineers are often prone to ignoring as we focus just on the technology and less on the effects that technology will have out in the real world. And so this chapter is really just an attempt to get people thinking about it a bit. And it's a reflection of my own process as well because as I started working on these systems, I didn't really think about ethical things particularly either. So I felt I had to put that section in there for myself as well as for the readers because it was my own way of grappling with these questions a bit. Is it fair to say that as engineers building these systems that will have an impact on a wide range of things potentially societal wide impact we are just in such a good position to directly influence and maybe even change course. So do I understand that this section is a bit of reminder that by building it we have a huge opportunity to shape these we probably have a lot stronger voices maybe as strong voices as later on the regulator might have years down the road. **** · Exactly.

**** · I think engineers have a very strong voice there and we talked about earlier engineers need to articulate trade-offs in such a way that business leaders can then make educated decisions about how to address those trade-offs. And part of those trade-offs is pointing out risks. And risks include not just technical risks the data might get corrupted, but they include societal risks as well. For example, what negative effects, what harms might arise from this technology, what unintended consequences possibly or what risk for reputational damage if it turns out that a technology has some harmful effects. that can reflect badly on the company that made it and that has to be part of the trade-off discussion and I just want people to make intentional and deliberate decisions about this things and not just sweep it under the carpet. One of the hot topics these days is of course AI and you've written a very interesting post about this just in December about formal verification and how your conviction that formal verification might be more important with AI. Can we talk for those of users who have heard formal verification, can we talk about what this is and how you envision this becoming more important? Yeah. So there's a whole range of formal methods.

Formal verification

**** · one approach is to for example use a specification language FSBY or TA+ or something that to describe the expected behavior of a system at a at a high level and then use a model checker which is essentially a randomized test case generator to just play through a lot of scenarios and see whether the system has those desired behaviors in all the different scenarios. That's the intro level formal verification. I would say the more advanced level is to use actual formal proof and in that case you can write a specification of some system in a formal language is usually using mathematical notation and then make a mathematical proof that a certain algorithm or certain implementation always satisfies that specification. And the distinction to testing there is that well in testing you just try through a couple of examples, give the algorithm some example inputs and check whether you get the expected output in those particular examples. But a proof can reason about potentially infinite state spaces. So it can tell you things about every possible thing that could possibly happen in the entire universe show that for example a certain safety property is always given in those formal verification is a lot of work.

**** · I never used it in my time in industry because it's just too timeconuming I only got into formal verification when I was in academia and I could afford to take the time to spend a few months proving an algorithm correct. But there I've started finding this very useful especially if I was working on very subtle algorithms where it's very hard to tell just from reading the implementation whether this is always correct under all possible cases.

**** · But if it's an important algorithm where for example it will corrupt data if there's a mistake in it or it will have a security vulnerability if there's a mistake in it then when it's high stakes things that then I feel it's worthwhile to have formal verification and to really make sure that the code really is correct and so I've done some formal proofs using the Isabel proof assistant for example there are a couple of others as well rock and lean and so on.

**** · These proofs are really hard to write.

**** · It's it takes a long time to learn the language of writing those proofs. And then even once the language, it's just really laborious in order to write the individual proof steps. And when you say it's hard to write, just as someone I know how to code, all so many different languages. Can you just explain what it means to hard to write? Is it does it feel a strict programming language with all sorts of rules or lots of math formulas?

**** · What makes it hard for you to learn it and get good at it?

**** · Yeah. So, you're trying to make a proof that a certain piece of code always satisfies a certain property. In some cases, that property might be quite easy to specify. Let's say as a really simple example, you have two lists and you want to concatenate them. And then you want to prove that the length of the concatenated list equals the sum of the two individual lists. very simple property. How would you prove something this? Well, you would have a function that concatenates two lists and then you would probably do a proof by induction over one of the lists that shows that okay, well, if you have one list of length I and another list of length zero, well then the sum of the two is I. If you have a list of length i appended with a list of length one, well then it's i + one and so on. And then by using a proof by induction, you can then show that the length of the concatenated list is i + j where i and j are the lengths of two the two input lists for every possible value of i and j. And this is something that in a test case you would in tests you would maybe test it for the cases of j equals 0, j equals 1 and j equals 5. And then you're done. Nj equals inter max. Yes, the edge case.

**** · That's what we do. That's how I write my unit test. Exactly. And so this is a trivial example list concatenation. You can easily just read the code and convince yourself that it's correct. But if it's a much more complex algorithm, then you our brains just can't grock the algorithm well enough to really convince ourselves that it's correct if you don't prove it. And that's where these proofs then become handy. If I'm I'm an engineer and I would I would be interested in getting started with formal verification for example because I have the notion that it will be more important with AI of course it will be easier to write these things. Where would you point engineers to get started or how did you get started in this field? I would suggest starting with model checking. So something TA plus or FSB are much friendlier to getting started with compared to proof assistants Isabel Rock and Lean. that these proof assistants just require a whole lot of additional know knowledge and the resources for learning about writing these formal proofs are to be honest not particularly good. I haven't really found really great books on it as well.

**** · The way I learned it was by working with some colleagues in my lab who had learned it through years of prior experience and I just sat down with them and paired with them at a desk where I described the thing I was trying to prove and they showed me how to prove it step by step how to break it down.

**** · I'm interested to see if what if you're thinking will be correct which is this thing will go more mainstream and hopefully we'll have better books and resources for it as well.

**** · Yes, I do hope so. So the reason I think that the I believe that this formal verification could become more important in the future is several aspects to it. One is that the LLMs are getting increasingly good at writing these proofs and if we don't have to write the proofs by hand as humans, it just becomes feasible to do them in situations where previously it would have not been economical. But also, LLM increase the need for these formal proofs because, we're vibe coding a bunch of stuff. If we have to manually review all of that code, then that will become the bottleneck.

**** · So, we can't really have humans reviewing all of the generated code either if we really want to get the benefits of AI. So, we need some automated way of checking whether the code is correct. And writing lots of tests is a very good starting point. But the thing that proof can do that tests can't is to consider absolutely every possible thing that could happen. And that's really important in a security context for example where it just takes one little bug want to create a vulnerability that destroys the security of the whole system. And so I feel for those domains where really we want to ensure there's a complete absence of bugs that's the places where formal verification can really shine.

**** · And I'm hoping that LLMs will make that a lot more accessible to people who would have previously not considered using formal verification because it was just too hard and too expensive. You've worked in the industry and then you went into academia. Can you tell us what the difference is between us? Myself and most people watching work in what you would call industry and the tech industry or work at different companies. We're bootstrapping our own or we're just doing build building our things. How does academia contrast to this? What do you and your colleagues do inside of academia? Yeah, within academia, there are lots of different styles really. There's not one thing. some people go full-on theoretical, mathematical, don't care about the real world at all, just want to work on things that are intellectually interesting. And that's fine. And some people are at the very much at the applied end of wanting to do research that is likely to have a real world impact. I'm more on the applied end. And that's fine too. But a common distinction there is that academia can just think much longer term. So the if you're doing a startup you have to ship something within a few months.

Academia vs. industry

**** · You can't afford to think 10 years into the future. Well, maybe you'll have a long-term vision that you're gradually getting towards, but you do have to really ship things on a fairly short time scale. At a bigger company, maybe if you're working on infrastructure or so, you can think on a bit of a longer time scale because the requirements of what are needed is are perhaps better understood. and in that case, they're making sure that the system is scalable, operationally robust, and so on. it's then fairly clear what the requirements are and it's still a matter of implementing it but in that case you can think a bit longer term but in academia what I really appreciate is the freedom to work on things that are long-term and which are not immediately commercially viable or which are not aligned with the incentives of commercial companies. so one of research area that I've been on for several years now is what we call local first software which is this idea that we want to take away a bit of the power from cloud operators and give it back to end users. So end users should be more in control of their own data and less dependent on cloud services for providing the applications and the data that the users need. And that's something that doesn't naturally come to companies, Because software as a service businesses, for example, the whole reason why they can charge a subscription is because they are able to essentially hold a gun to the customer's head and say, "Pay us your subscription, otherwise we will delete all your data."

**** · And I totally understand the commercial imperatives that lead to that, but it also leads to this situation where the people have a gun against their head all of the time.

**** · That isn't really a healthy situation to be in my opinion. But changing that in such a way to take away that gun from customers heads is difficult if you're in a business whose revenue depends on perpetuating that lock-in situation. And there I feel in academia I have the freedom to work on things that go against this commercial incentive of companies and say no I'm going to do what I think is for the users and that I'm going to say the commercial model of the companies making the software is second priority and I can afford to do that because I'm I'm not dependent on this commercial model.

**** · To add to this, it's very interesting and challenging engineering problems.

**** · **** · Yes.

**** · And it's wonderful to get to work on interesting engineering and computer science problems while at the same time trying to pursue this this higher level vision for local first for first software. What are some of these really interesting engineering challenges that we will need to solve or we need to solve to get to a more viable local first software? May that be let's say note-taking. It's a very popular one, ### Local-first software

**** · Yeah.

**** · So with our vision of local first software, we're trying to get away from this dependency on centralized cloud services. There may still be cloud services involved in syncing data between your phone and your laptop say because often going via cloud service is just the most convenient way of establishing that communication.

**** · But we just don't want to have to trust on a cloud service providing a particular function. Then if you can get away from assuming this one cloud service, you could for example have multiple cloud services on multiple cloud providers side by side and you just sync by whichever happens to respond first or sync with all of them and then if one of them disappears, no problem because you've got the other one. And so it gives us a huge amount of freedom and flexibility if we get away from this assumption of centralized cloud services. But that introduces a whole bunch of interesting research and engineering challenges because so one thing that we've been working on lately say is access control. simple problem you have a document you want to be able to grant collaborators access and you want to be able to revoke that access. Again totally obvious to should be totally straightforward. In a centralized cloud service model it is totally straightforward because you have the rules you you confirm that those things and you check for the roles and that's it.

**** · Yeah.

**** · But if you want to run your system over multiple providers or even in a peer-to-peer setting then well what could happen is that a user gets their edit permissions revoked and concurrently that user makes an edit to the document whose permissions have just changed and now some devices may see the edit to the document first and the revocation second and so they would accept the edit to the document and another device may see it the other way around. They may see the revocation first and then the edit to the document second and they'll drop the edit to the document because they think it's not authorized. And now those devices have become inconsistent with each other permanently inconsistent. So that means if we want to ensure consistency even for this fairly basic setup we now have to somehow figure out how to resolve this situation of an edit that is concurrent with the revocation of the user who made that edit. solving that problem then mean in a decentralized setting where we don't have just a single server that can make that decision in a centralized setting you just have one server it decides did the edit to the document come first or did the revocation come first and that one decide server makes that decision but if you have multiple servers they might make different decisions so then you could have a consensus protocol but then consensus is messy because it requires some quorum votes and requires nodes to be online and so we've been trying to do the whole thing without doing consensus.

**** · But while so while preserving high availability, while preserving the ability for user to work offline, preserving the ability to synchronize peer-to-peer without any servers, for example, that just makes the engineering challenge a lot harder and it's solvable and we are close to solving it for automerge, which is the CLDT library that I work on. but it's it's just much less straightforward than it is in the in the centralized case.

**** · But that's a nice example of where interesting engineering challenges arise from this desire to get away from centralized services. And then we were just talking about clocks earlier. But an obvious thing that came to mind is well if all of them had the same clock exactly to the microscond, you could just use a clock, you could use a time stamp, but as you said in distributed systems, we cannot always trust the clocks are always synchronized. So I assume you just have these a lot of the things that you have been researching and writing about are just coming back to Absolutely. And in this particular setting of a user getting their edit permissions revoked if a revoked user still wants to say vandalize a document they can just backdate their edit give it an earlier time stamp. So relying on clocks is absolutely useless here because people can forge the time stamps from those clocks and thereby then potentially undermine the access control mechanism. So in this system, we have to worry about potentially maliciously generated actions as well when the actions come from end user devices. This is fascinating because it feels to me that you're solving a hard or maybe even harder engineering challenge than some startups would do because the startups would go the easy route. They would take on a constraint in this case a centralized server which makes business sense, makes revenue sense. But because you are not doing this, you now need to look for a solution for a harder problem. And if you solve this harder problem, you can give a building block that can just move the industry forward.

**** · Just give a an option for either a business or an individual or an institution to have an option not just to use centralized but use this decentralized local first approach and then of course reason about the trade-off and decide whichever makes sense.

**** · Exactly.

**** · And that's what with this long-term thinking. This is an example of it where because it's research we can afford to take this idealistic principled stance. I said yes we're going to solve this harder engineering problem because we think decentralization is a valuable feature and we know perfectly well that most startups are not going to solve this problem because they will just do the easy pragmatic thing which is the thing for startups to do. but we have a different set of incentives and we can afford to put in the time to try and solve those hard problems. And as you said, if we can solve them, then it creates more optionality for anyone, any users of this technology, they can if they want to choose to use this decentralized tech. And there's still trade-offs around it, but at least if they're not having to invent it from scratch, it'll be a lot easier to adopt this decentralized tech for those who want to use it.

**** · So in inside academia you're also teaching. what courses do you teach?

**** · At the moment I have a concurrent and distributed systems course for the undergraduates and a cryptographic protocol engineering course for the master students. And then additionally this year I have a a seminar course on security and a and teaching also the undergraduate operating systems course. I've got quite a lot of teaching this year. the distributed systems course, it's available on YouTube.

Computer science education

**** · Can you summarize what people who would go through this course which again is freely available? Thank you for you and the university for making it available.

**** · What what would they learn throughout those courses? Yes. So that distributed systems course, it's a bit more theoretical than what is in the book. So it's more focused on algorithms and the how we convince ourselves that the algorithms behave correctly under the assumptions of distributed systems that we talked about of nodes may crash, communication might be unreliable, clocks might be wrong, etc. So that's really it. It's it's not a very long course. It's just eight lectures worth of material.

**** · But it's it goes into substantially more detail on the algorithms than the book. So for example, one of the lectures goes through the entire raft consensus algorithm which is pretty complex. but I really wanted to show the students exactly how it works because it's just such a nice illustration of the challenges of distributed systems and the various measures we need to take in order to handle the various types of edge cases and failures that can happen and showing that those problems can be overcome. It's not easy and the algorithms are very subtle and it's very easy to have bugs in them but it is possible to solve consensus in a in a way that works pretty well and and so that's really this the message I'm trying to get across with this course and you mentioned that when you're when you're writing the book together with Chris you brought a lot industry insight and being up to date and you brought your experience of teaching and what works I don't think I have a particularly unique teaching style just in lectures I will go through slides. I to annotate the slides by hand during the lectures. I've just draw on an iPad to make it a little bit more interactive. But other than that, it is fairly theoretical. That's partly the way the Cambridge system works. It favors theoretical and pen and paper courses over say implementation practical courses. I think it would be possible certainly to do a practical course on this and I may incorporate a bit more practical exercise in the future but now it's mostly a theoretical pen and paper course when that is fine. the cryptography course that I do is that's much more hands-on. So that's about getting the students to implement some elliptic curves from scratch for example. And how have you seen it in your time in academia which has been it's now a longer time period. How have you seen computer science education changing? How do you think it might change further in the future especially as we're seeing AI u be part of industry and probably the world as well? Yeah, prior to AI explosion happening rate of change is very slow in computer science teaching. Partly that might be Cambridge, Cambridge is over 800 years old everyone thinks on longer time scales. People don't tend to rush into the latest fad and instead try to focus on the fundamentals and the ideas that a lot of the fundamentals of computer science were developed in the 1930s already and are still true today.

**** · and lambda calculus and those types of things for example and so we have quite a bit of a focus on those fundamentals rather than chasing the latest fashionable thing. That said, AI has totally changed the way we can assess coursework, for example, because of course now we can try banning AI, but it's impossible to enforce such a ban. And also, it's counterproductive because we do want students to engage with new technologies and figure out how to use them productively for themselves. But we want to somehow do that in a way that supports their own learning and doesn't undermine it. So, how do we get the students to use AI in a responsible way, in a way that's mature? And we can't necessarily rely on the students being mature enough to know for themselves what is a helpful use of AI and what is a form of use of AI that undermines their own learning because some of them are quite mature and able to decide that for themselves, but many are not and so we need to provide some guardrails for them. and we do need to make sure that when we have assessed work for example it's fair and it's perceived as fair by the students and if the students feel that some of their co- students are getting really good marks without doing any work that undermines the trust in the entire system and so we have to be very careful with how we approach this and to be honest we don't really have good answers yet. So we do now for example have a boot camp at the start of the first year for the new students to expose them to basic software engineering skills which is this is version control, this is unit testing, this is generative AI and the basics that really everyone should be familiar with and then the hope is that they will use that throughout their degree in order to just improve the work that they do. But how exactly we handle things for assessment for example we're we're still in the process of figuring out. So it sounds the the pace of change is going to be fast in the industry and also in academia we'll probably adopt it and we'll see what comes after. Yes. There's a difference though which is in the desired outcomes. I think with industry generally the desired outcome is a working product for example. In academia the actual artifacts that the students produce an essay that the students write that's not really the point. We don't ask the students to write essays because we love reading their amazing essays. We ask them to write essays because we want them to go through a thought process which helps them learn something. And it's that thought process and that learning which is really the desired outcome here. And so that means that we do have to approach it a little differently because in generally in industry, if you can use AI to get a job done faster and you get to the an equivalent result, do it absolutely because yes, that is the desired outcome. whereas in education we do have to think about how we ensure that the learning outcomes and the thought processes are still preserved such that the students benefit intellectually. It's very relevant especially entropic had a recent study where they looked at junior engineers they one of them used one group used AI the other one did not and they found unsurprisingly from what you also explained that the group who used AI they had little to no learning whereas the group that did not they learned it. Yes, I saw that study as well. I think the meth detailed methods of that study we might be able to quibble with a bit but I think the general principle seems true that yes so sometimes in order to learn something you just have to struggle with it a bit not struggle too much so if people are stuck on some technicality and they can use AI to get unblocked and then be able to focus really on the main learning outcome then I think it's good to use these types of tools but if the point is to grapple with some difficult ideas and think them through their own minds, then we need to still find ways to make sure the students are doing that.

**** · You work both in industry and academia.

**** · What do you think industry could learn from academia and academia can learn from industry? The two really could be closer together because often they regard each other with disrespect really the industry people will say, "that's theoretical, that's academic, it's got nothing to do with the real world." and they're really missing a trick there because there's a lot of interesting insights from research that are very relevant to the real world. but they're not necessarily making their way across that chasm. In the other direction, the academics will say, "Oh, this industry stuff, that's just engineering." They're not doing any interesting thinking. It's just writing routine stuff. I think I see it as one of my goals to try and build better respect across both in both directions by bringing interesting insights from research into industrial practice but also by informing our research by the problems that arise in real world and so that way joining those two things up a bit better. What are your current research topics that you're working on ones that you're excited about? I have two main areas I'm working on at the moment. one is local first software. So that's this idea that we want collaborative software Google Docs, Figma, etc., but in a way that gives better protection to users data that's less dependent on a single cloud provider who can lock you out of your files and that's therefore more resilient. gives users greater agency and greater autonomy over their own data. U so that's an area that I've been working on for the last 10 years or so through a mixture of open source work and algorithm development and formal verification and so on. I'm now also trying to set up a brand new research area in a totally different topic which is on using cryptography to prove things about the physical world. So I'm interested there in especially sustainability related things. So for example, if you want to verify that the carbon emissions involved in manufacturing a particular product were X and you want to be sure that number is correct because maybe you want to include emissions as part of your purchasing decision and choose the product with the lower emissions. For that to be meaningful, the emissions number has to be correct. And unfortunately at the moment the numbers are generally not correct because the incentives are to lie and cheat and to use creative accounting techniques all as a way of greenwashing or a related thing is happening in the EU for example which is bringing in new regulations on preventing deforestation of tropical rainforests. So that's for example coffee, cocoa, palm oil etc imported into the EU. the importer needs to prove exactly which plot of land it came from and then check against satellite imagery that was not recently deforested. And so I've been looking into using cryptography as a tool of proving things about the supply chains of these physical products but without revealing commercially sensitive information. For example, a company will not want to reveal who its suppliers were and which ingredient to its process it purchased from which supplier, for example, because that might reveal something about its secret recipe that it uses. And so the hope here is that cryptography can allow us to prove that for example the accounting has been done correctly across supply chains but without having to reveal publicly any of this sensitive data about suppliers or other customers.

Martin’s current research and advice

**** · What is your view from your vantage point on the impact that AI is having on academia not just for students studying beyond that and also industry with your industry contacts? Yeah, I'm not that deeply into the AI things really. I'm seeing it more through my collaborators who are making very good use of AI tools for software development especially. I personally write very little code these days and so I haven't had that much need or occasion to use AI agents myself personally. When writing pros working on the book for example, I prefer to still do that the oldfashioned way of just write every word by hand. So I haven't let AI anywhere near the text of the book for example. And I don't know if that's that's the decision. It's not really a principle thing that I think it would be wrong to do so. It's more that for myself the process of writing is the way how I figure things out and figuring things out is really my goal here. So I'm I'm trying to figure it out in my own head and for that I just have to write it myself. Does there doesn't seem to be any way around it.

**** · But using AI as a way of getting feedback on ideas or exploring whether an idea really holds up to scrutiny or things that seems a very productive use of the technology and that applies for both industry and academia I would say. So as closing for a student or a young professional who is still studying and considering the route into either industry or academia, what have you seen who thrives in one or the other?

**** · Yeah, my feeling is they're not really that mutually exclusive or rather some of the best PhD students I've worked with for example have a few years of industry experience. So they might have done an undergraduate maybe done a masters then spent a few years in industry developing actual doing real software engineering learning about the real world and then maybe at some point got bored and thought oh I want to work on maybe more idealistic things or have more freedom to choose their own research topics and then start getting interested in doing a PhD and that I find is quite a healthy route. You do get people who go, straight from their undergraduate degree and masters into doing a PhD, but sometimes those people can just lack a bit of the breadth of perspective. And so I think having seen a bit of just real world engineering is really helpful for people even if they then want to stay in research. But in the opposite direction, I think it can work very well too because in research in academia, we just get to think things through a lot more carefully than people often do in industry. Often people in industry, I feel have short circuit reasoning, don't maybe don't quite reason something through from first principles, but just oh, I heard this from a conference talk.

**** · I'm just going to go with that. And oh yeah, what academia can teach is this nuanced and critical thinking to really reason through trade-offs, for example, and to really justify why something is true. And so I think it's really good if people can weave in and out of industry and academia a bit and not regard it as two totally mutually exclusive career paths, but have a bit of switching between the two.

**** · Well, Martin, thank you very much. I expected us to talk a lot more about your book which we did but I have a newfound curiosity and respect for all the important and interesting academic work that you and everyone else is doing. So thank you so much for this. Thank you for the great interview. This was really interesting.

**** · I hope you enjoyed this rare conversation with Martin Clubman. I found it interesting to learn that the first edition of the book assumed that you have machines with local discs. But today this is not how most engineers build systems anymore.

**** · cloudnative primitives S3 change how you build systems and this is why this book just needed a refresh. I also appreciated Martin's take on whether engineers still need to undertest system internals when they're using managed services. If you're building business logic on top of these services, you probably don't need to know every detail, but it can become useful to be able to look deeper, especially when you need to debug your system. By the end of our conversation, I gained a lot of appreciation for the academic research that Martin is doing. the local first software work, the access control problem in decentralized systems, using cryptography to verify supply chain emissions. A lot of these are hard engineuring problems that few startups would take on. It was nice to understand how academia is in a good position to do work that has a long-term focus. Do check out the show notes below for related to primatic engineer deep dives.

**** · If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show. Thanks and see you in the next one.