No Priors: AI, Machine Learning, Tech, & Startups - No Priors Ep. 143 | With ElevenLabs Co-Founder Mati Staniszewski

Here is the cleaned YouTube transcript, processed line by line:

0:05 · Hi listeners, welcome back to No Priors.

0:08 · Today I'm here with Motti Staniszewski, the co-founder and CEO of 11 Labs, which was founded to change the way we interact with each other and with computers with voice. Over three short years, they've skyrocketed to more than 300 million in run rate. Motti and I talk about the future of voice education, customer experience, and other applications of this voice, as well as how to build a multi-segment company from self-serve to enterprise, and combine research and product.

0:35 · Welcome, Motti.

0:37 · Thanks for having me, and thank you for doing this at 7 in the morning.

0:40 · Our pleasure. Thank you for doing that at 7:00 in the morning. It's great we finally got to do this together. I think a lot of our listeners will have used or played with 11 Labs at some point, but for everybody else, can you just reintroduce the company?

11 Labs: Growth and Scale

0:52 · Definitely. At 11 Labs, we are solving how humans and technology interact, and how you can create seamlessly with that technology. What this means in practice is we build foundational audio models. So, models in this space to help you create speech that sounds human, understand speech in a much better way, or orchestrate all those components to make it interactive, and then build products on top of those foundational models. We have our creative product, which is a platform for helping you with narrations for audiobooks, with voiceovers for ads or movies, or dubbing those movies into other languages. We also have our agent platform product, which is effectively an offering to help you elevate customer experience, build an agent for personal AI, education, new ways of immersive media, all under the light of that mission of solving how we can interact with technology on our terms in a better way. You started the company in 2022, that's right, and you've had amazing rocket ship growth since then. I'm sure it's felt up and down in different ways. I want to ask you about that. Can you give a sense of what the scale of the company is today?

1:57 · So, we've grown to 350 people globally.

2:01 · We started from Europe. We started as a remote company and are still remote-first, but have hubs around the world, with London being the biggest, New York being the second biggest, Warsaw, San Francisco, and now Tokyo and one in Brazil. We are at 300 million in ARR, which is roughly 50/50 between self-serve, so a lot of subscription and creators using our creative platform, and then approaching 50% on the enterprise side using our agents platform for work. That's on the classic sales-led side, and we serve more than 5 million monthly actives on that creative side of the work. Then on the enterprise side, we have a few thousand customers, from Fortune 500s to some of the fastest AI-growing startups.

Voice Technology and Applications

2:46 · I think this is such an interesting company because it is unintuitive to many people and investors in particular. I don't know if you faced this at the beginning, but we were both there in 2022. There's a class of companies that allow creation in some way. When we look at your first business beyond the research itself, I would put 11 Labs, Midjourney, Suno, and Hugging Face in this category. I think there's this overall sense of, "Who really wants to do this?" What was your initial read on how many people want to make voices, or what made you believe that was going to be much broader than, for example, dubbing, which is not a huge market?

3:32 · First, as you mentioned, it's very tricky to do both the product and the research. I'm in a lucky position that my co-founder and I have known each other for 15 years. I think he's the smartest person I know and has been able to create a lot of that research work to create the foundation to then elevate that experience. But both of us are from Poland originally, and the original belief came from Poland. It's a peculiar thing, but if you watch a movie in Polish, a foreign movie in Polish, all the voices, whether male or female, are narrated with one single character. So, you have a flat delivery for everything in a movie.

4:09 · A terrible experience.

4:10 · It is a terrible experience. As soon as you learn English, you switch out and don't want to watch content this way. It's crazy that it still happens today for the majority of content. Combining that, and I worked with Paal, my co-founder worked at Google, we knew that would change in the future and that all information would be available globally. As we started digging further, we realized it would be available in every language in high quality. That was the starting point. The big thing was, instead of just having it translated, could you have the original voice, original emotions, original intonation carried across?

4:52 · So, imagine having this podcast, but people could switch it over to Spanish and still hear Sarah, still hear Motti, and the same voice, the same delivery. This is kind of exactly what we did with Lex when he interviewed Narendra Modi, and you could immerse yourself in that story a lot better.

5:09 · So that was the original insight. We then started digging further, realizing that so much of the technology we interact with will change. It's still relatively tricky to bring voice alive. You need to go through the expensive process of hiring a voice talent, having a studio space, having expensive tooling to adjust it. The tooling isn't intuitive to be able to do this. So, all that creation process will and should change to make it easier for new people with keenness to bring that to life. Then, a lot of the technology wasn't possible for you to recreate a specific voice or create it in that high-quality way.

5:53 · Of course, as we dived further and shifted away from the static piece, the whole interactive piece is still crazy in the way it functions. Most of us have seen technological evolution over decades, but you still spend most of your time on the keyboard. You look at the screen, and that interface feels broken. It should be where you can communicate with devices through speech, through the most natural interface there is, one that started when humanity started. We realized we want to solve that. Fast forward from 2022, I feel like many people will carry that belief too: voice is the interface of the future. As you think about the devices around us, whether smartphones, computers, or robots, speech will be one of the key ones. But in 2022, it wasn't, and as you think about the market for the creative side or the interactive side, it was clear it would be a huge one.

Research and Product Development

6:52 · So, even when you think about just the research part of your business, and then you have products for at least two different markets, and then you have this larger mission. A lot has changed in the last 5 or 10 years, but it used to be a strongly held traditional belief that one must do one thing well in a startup, and there's no other path. You're treating this like an interaction company, a platform company. How did you think about sequencing the research and product effort? Does that make sense? Or thinking about new markets? And maybe wrapped up in that question too is, where are we in quality on voice? Because if the models are not good enough for certain use cases, it doesn't make sense to do product. I think that's right. It's almost exactly like when we started originally. What we did was try to use existing models that were in the market and optimize them. Our first use case was actually starting with a combination of narration and dubbing on the creative side. We realized pretty quickly that the models that existed produced such robotic and not good speech that people didn't want to listen to it. That's where my co-founder's genius came in, where he was able to assemble the team and do a lot of the research himself to create a new version of creating that work. But to your question, the way we are organized internally and how we think about sequencing a lot of that was looking at the first problem and then creating effectively a lab around that problem, which is a combination of researchers, engineers, and operators to go after that problem. The first problem was the problem of voice. How can we recreate the voice? And like you say, it needs research expertise to do that well. So we started with effectively a voice lab, which was that mission: can we narrate work in a better way? It was a combination of roughly five people doing that work. Then we sequenced the research first, then built a simple layer on top of that work to allow people to use it, and then expanded from there with a holistic suite for creating a full audiobook and then creating a full movie narration or movie dub. Then we moved to the next problem: the realization that okay, we have solved the voice, great for making content sound human. The first problem for that to be useful for us to interact with technology is solving how you bring knowledge on demand into that. So we effectively started the second team, which was a second lab, an agent lab, effectively, which was a team that would combine researchers, engineers, and operators once more, which would try to fix: okay, we have text-to-speech, how do they now combine this with LLMs and speech-to-text and orchestrate all those components together while integrating that with other systems to make it easier? Then similarly, you expand from looking just at the voice layer into how those systems work together. Here too, you need research expertise to do that in a low-latency, efficient, accurate way. But at the same time, there's that product layer that starts forming. It's not only the orchestration that matters; it's also the integrations, how you link up to legacy systems, how you build functions around it, or how you deploy that in production and test, monitor, and evaluate over time.

10:11 · Do you feel like you were creating new use cases when you built the tools? Do people know that they wanted to do this already? Because one argument I remember hearing was, "Enterprises don't know what to do with voice. How many people really want to do it?" And then you're serving essentially the creator or publisher side of your business.

10:30 · Yeah, it's definitely a combination of initiatives that we believe will happen in the world and then a response to a lot of that. As I think back, of course, voice, the internal voice lab, or agents lab, then kind of kickstarted so many of the other labs in response to the problems. We started a music lab because people wanted to create music with 11 Labs. It was a fully licensed model where people wanted to use and create speech but wanted to add music in a simple way. We wanted to deliver that, and then of course, that kind of came together through how do we combine music, audio, sounds. We are now integrating partner models from image and video into that suite. How could you combine all of that in one? A lot of that was in response to the market saying, "Hey, we would love this." Then you have completely different use cases even in that space. Let's say dubbing.

11:20 · Dubbing is a use case that we didn't feel there was a big push for, but we knew that in the ideal world in the future, you would be able to have that content delivered naturally around the languages, still carrying that. I still think actually this market will be immense because it's not going to be only the static delivery in movies, but if you travel around the world and want to communicate in real-time, like the full Babel Fish idea from Hitchhiker's Guide to the Galaxy, this will happen. It will be the biggest breaking down of language barriers, barriers to communication, to creation. All of that will break, and that will be the foundational real-time dubbing concept. So, I'm super excited about that part. Similarly, on the agent side, you have some obvious things that customers or partners will want to integrate, like integrations with XYZ systems. But then there are other parts that might not be as easy to predict. As you interact with technology, you want to understand what's happening, but you also want to understand how things are being said and bring that into the fold. That would be something we try to prioritize on our side. So then people, when they actually interact with the technology, realize, "Oh, expressing a thing is actually so much more enjoyable and beneficial and helpful." So, I want to ask a question about this, which relates to quality. I work with a series of companies where we're selling a product to buyers who are generally not machine learning scientists. Even the scientific community does not have the full suite of evaluation benchmarks to understand every domain well. That's a well-known problem, but I imagine for a lot of your customers, it's not like they know how to choose a good voice. So, how do you deal with that problem? Is it like, "Hey, I make a clone, and it sounds like me, and I believe it. I'm going to try all of these different options"? Or are you teaching people to evaluate?

Voice Quality and Customer Preferences

12:49 · It's a great question because I think there are two big problems. One is how do you benchmark the general space in audio, where, as you say, it's so dependent on the specific voice, let alone if you are training into interactive, then it's even trickier. The second piece is, as you are working on a specific use case, how you select a voice. I'll take the second front first. We have a voice curator. As we work with enterprises, we deploy that person to work with them and help them navigate. That person is like a voice coach, has an incredible voice themselves. Now we have a team under that person that will partner to help you find what's the right branding. Now you have the celebrity marketplace, and now we have a celebrity marketplace to help you even get iconic talent in there, like Sir Michael Palin. That piece was important because, of course, the voice will depend on the use case you are trying to build, the language, all of that will have an impact on what's the right voice for your customer base. So, we have a voice person helping those companies. Some companies will be very opinionated on what they want.

14:23 · So, they will sometimes select it themselves, sometimes give us a brief of, "Hey, we want a voice that sounds professional, neutral, commanding." We recently had a company, one of the biggest European companies, that wanted a very original brief: they wanted as robotic a voice as possible.

14:42 · Okay.

14:42 · It was counterintuitive.

14:45 · But for you, we can't do that anymore. But we were trying to go backwards, how do we do that? I think we got a good result. Recently, we had a company in Japan and Korea where they wanted to serve different voices depending on the customer that's calling in. They have an older population and a very younger population. For the younger one, they wanted one of the famous voices in the market that's very excitable and happy.

15:12 · For the older one, they wanted a calm, slow-speaking one. We help a lot with that. So, that's on the voice piece. I do think it's going to be a big, important, personalized choice, and then it can even be dynamic in a customer.

15:23 · Yes.

15:23 · Exactly. Exactly. And then maybe in the future, it's going to be fully depending on your interaction. You will have a voice created as we understand the preferences of what people want. So, let's say you're in the evening and you are tired, and you want a slightly different voice, or maybe not. Maybe that's the best focus time that you have, a voice that's giving that energy. And probably it's different when you wake up and gives you the morning news of what's happening or what's the weather. So, all of those could be different. Yesterday, we had dinner with some of our partners, and one of them, the first thing they said is, "Hey, I have a new request for you. I want a New York voice with a Long Island accent." I never knew that was a thing, and it supposedly is a thing. So, we have that. On the first piece, I don't think it's an unsolved problem still. I think you have good benchmarks, of course, in LLMs. In the image space, they are pretty good. In the voice space, you have, of course, the speech quality, but then so much of whether you like the speech depends on the voice. If you compare model A to model B and serve them different voices, even if the quality is very different, the voice itself can make that sort of difference. We've seen this. I don't know if you know Artificial Analysis benchmarks. I think they're pretty good.

16:41 · Just switching the voice makes such a big impact.

16:43 · That's so interesting. Yeah. And I wonder if, as you said, this is the most dominant interaction mode we've had for millennia of all human history, right? And so bias is serving, but I think so.

16:56 · We're just very sensitive to it. And I think people are going to be very sensitive to their own personalization as well.

17:03 · 100%. I think there's also a third piece which maybe is not directly to your note, but we've also realized that you have benchmarks, you have how do I find the right voice for my audience, but even the understanding of how you describe audio data is still lagging in the industry. When we initially started, we of course went to the traditional players for them to help us label not only what was said, so transcription, but also how it was said: what are the emotions, use accent. Most people just weren't able to do that work effectively because you kind of need to hear and have a little bit of a skill set of how would I describe this specific delivery. So, we needed to create that ourselves. I think there is that piece as well of how do you effectively interpret the data of audio on a more qualitative basis. That's trickier. Can you talk about what's happening on the agent platform side? What is challenging for businesses or even creators that are trying to build agents? And what are maybe the surprising or high-traction use cases? I think everybody's kind of aware of the idea of agent-based customer support, but I imagine you're doing many things beyond that.

Agent Platform and Use Cases

18:14 · Yeah.

18:14 · So, customer support is probably the one that's kicking off the quickest, and that's the one that we see overtaking so many use cases, whether it's with Cisco or Twilio or Tel Digital. All of them are kind of elevating that to a high extent. I think the second exciting piece within that domain which is happening is the shift from effectively reactive customer support – I have a problem, I'm reaching out to customer support – into more of a proactive part of the experience.

18:45 · To make it explicit, we work with the biggest e-commerce shop in India, Meesho, where they started working on the customer support side, where I want a refund, I want to see the tracking of the package, to actually having an agent be a front part of the experience. So, if you go to the website, you have the widget, you can engage it through voice, and you can ask it, "Hey, can you help me navigate to item X, item Y, or can you explain what's the right thing for me to give as a gift for this period of time?" And then it will actually help you based on your questions, based on what's on offer, show you those items, navigate to the right parts of the piece, maybe go all the way through the checkout. I think this will be a phenomenal thing of elevating the full experience, where that's more of an assistant across the whole thing. We kicked off our work with Square, enabling sort of businesses to do that work. Exactly the same pattern started with voice ordering. How can this now be part of the full discovery experience too, where you get items shown to you? You can have a lot more explanation, which I think will be a phenomenal piece where, effectively, from the beginning to the end. So, that's one category. The second one is the wider shift from static to immersive media, where there's just so much incredible stories and IP that today exist in effectively one way of delivery, and now you'll be able to interact with that content in a completely new way. We, I think one of the incredible use cases was working with Epic Games. We worked with them on bringing the voice of Darth Vader into Fortnite, where millions of players could interact with Darth Vader live in the game, where you had a full experience of Darth Vader in a new way. I think this will be a theme across whether it's talking to a book, talking to the character that you like, the whole space shifting. Then, I think the one that I'm most excited about for the world and for the shift is going to be education, where you will just be able to have effectively a personal tutor on your headphone, and you can actually study something in an amazing way. I'll give you two quick examples. One is we recently worked with Chess.com. I'm a huge fan of chess.

21:03 · Okay, great. So, you can learn chess, but you can have Hikaru Nakamura or Magnus Carlsen be your teacher of how you deliver that, which is amazing. Or even Boris sisters. It's all the plethora of different players that engaged with that, which I think is great. Then, maybe a last one, which is MasterClass, who we worked with to shift from you can have the content go step by step, but you can also have an interactive experience. The best example of that was working with Chris Voss, the FBI negotiator, one of the top negotiators who has a MasterClass lesson, but then you can actually call him and have a practice negotiation, which is crazy.

21:42 · Yeah, got to get that hostage out. We'll definitely try it.

21:45 · Yeah.

21:45 · Can I add one more? I think the one last one which combines all of them together, which I realized just recently, was crazy. Recently, I went to Ukraine, where we are working with the Ministry of Transformation, where they are effectively creating the first agent government.

22:03 · And the crazy thing is they have all of those government agentic governments. So they want to change how they run all the ministries.

22:12 · Okay.

22:12 · And it sounds like a big, ambitious goal and lofty.

22:17 · No, I think the baseline is here. So, actually, I'm by that immediately.

22:20 · Yeah. And the crazy thing is, I think they are so ahead in actually doing that. And I think there are two concrete things there. One, they kind of combine all those use cases. So, we are looking into how they can have effectively customer support of government, whether it's asking about benefits or employment, about the process of how you leave the country. All of that be run through effectively a digital app. Then, two, how you can have a proactive way of informing citizens of things that might be happening. But then having an education system that also runs through like this personal tutoring experience. All of that is happening. So, that was incredible to see. The second amazing thing was the way they've done it.

23:00 · So, they have the digital transformation piece, but they have engineering leaders in each of the ministries that lead those efforts and then bring them back to that one central piece. So, that is incredible to see and also proud to be able to be working with them on that shift, that despite everything that's happening, they're so... That's amazing. That's really encouraging. Can I ask you a business model question here because looking at the strategic landscape? Actually, I have many questions here. One of the observations I'd have is, if I look at one of these rich voice and action agent experiences, there are a lot of, let's say, Fortune 500, Global 2000 leaders who listen to the pod. I think a lot of them are going to buy the idea of, "I want this amazing, automatic, real-time, available 24/7, every language experience for my customer that's consistent and high quality." The ways I might get there include working with Palantir or a large consulting firm, working with 11 Labs or a platform technology company, or like an OpenAI, right? Let's talk about that. Or working with a more use-case-oriented company like Sierra, right?

Choosing the Right Technology Partner

24:20 · How do you think about how people are making that decision, or how they should make that decision? My past is also in Palantir, so I started exactly from that side. We do blend a lot of the forward-deployed engineering inside of the company too. As I think about our offering and customers making that choice, if you're looking just as a one-pointed solution and only that one, then likely we aren't the best choice. If you are looking to deploy that across a plethora of different experiences, so be it customer support, but then you also want internal training, then you might want to elevate your sales part and actually increase the top line with new experiences of how you engage customers beyond that reactive piece.

25:01 · Then it's a great platform to build, and then we effectively, as we engage with customers, combine that platform work with our engineering resources to help those companies deploy on that. Or, which we also see increasingly in Fortune 500s, G2000s, where they will want to build parts of the things themselves because they already have a lot of investments in that platform, while then engage us on some of the new ones and combine those. I think that our model and the way it's different to a lot of the use-case-specific ones is that our platform is relatively open, where you can use pieces of that platform and not all of them for those different use cases.

25:41 · Palantir, of course, will have a lot more resources to go in the wider digital transformation journey. In our case, it's very specific conversational agents. If you are looking for a new interface with customers, that's the best way. And companies like Sierra are phenomenal, of course, on how they are thinking about the specific pointed use case. The maybe the other piece is, as we think about our work, depending on how you are what you are optimizing for. So, we have a lot of international partners. If you have a wider geographic user base, great.

26:19 · That's what we optimize for. Our voices, our languages, our support for integrations internationally are just so much broader. There's frequently a piece that you will look into depending on your exact scope. This will be a big factor. But I would summarize that if you're looking for a solution across a set of different use cases, that you want our engineering help to deploy that, then we are the right solution and probably the best solution.

The Role of Foundation Models

26:43 · I want to talk a little bit about maybe OpenAI and the foundation LLM foundation model companies. One of the reasons, and I called this podcast No Priors, is because we're like, "Okay, people are making a lot of assumptions all the time about how the market is going to work." Lo and behold, many of those assumptions end up being nonsense. You can't, you have to very much decide your own narrative at this point in time. I think, correct me if I'm wrong, in 2022 and 23, you probably heard a lot of people say, "Google can do this, and OpenAI can do this. Why do you get to persist working on voice anyway as a general capability?" What's the answer? That also adds another element to the couple of other previous questions, where whether it's agents' work or the creative work, to deploy the value in those, you need a very strong product layer, you need the integrations, you need to help people deploy the work, which is the most common piece. But our superpower and our focus for a long time was building the foundational models to actually make that experience seamless. As I think about a lot of the companies in the market, they will optimize for a lot of other things, and that will be the differentiator. In our case, we will make the whole experience, especially with voice, seamless, human-controllable in a much better way. So, fundamentally, you would argue that the labs just aren't going to focus on this and haven't exactly. So, I think most of those companies, and that's the thing about the long term, it's going to be incredible research and incredible product that meets customers where they are and works backward from there. I don't think the labs will focus on building that product layer that's so important.

28:26 · But I think the part of the question that you're asking is how or why they haven't done even the research part to the quality that we've been able to. Here, I'm also biased, but we are happily beating them on benchmarks with text-to-speech or speech-to-text or the orchestration mechanisms. Here's credit to my co-founder and the team that they've been able to do it. It's just our researchers continuing their work. But I think the main part that I think is different in the audio space is that you don't need the scale as much as you need the architectural breakthroughs, the model breakthroughs to really make a dent. And we've been able to do that a couple of times. I think the number of people doesn't matter, but the people that you do does. We think there are maybe 50 to 100 researchers in the audio space that could do it. We think we have probably 10 of them in the company that are some of the best ones.

29:23 · And I think this obsession of just those people working across and then actually giving the full focus of the company on making them actually work on that and bringing their work to production, seeing how the users interact back, was so important. So, that's how we've been able to create models better than some of the top companies out there. But, you know, the truth is, to a large extent, why they weren't able to do it is also interesting. We don't know. It's a... they have such incredible talent there too.

Open Source Models and Future Trends

29:58 · How do you think at the same time about open-source models? Anyone you ask in the company, I think, will say that same narrative. We think that in the long term, models will commoditize, or the differences between them will be negligible for some use cases. They will still matter for most use cases, they won't, and they'll be broadly available. And we don't know where that is, whether it's two years, three years, four years, but it's going to happen at some stage. Then, of course, you will have a fine-tuning layer that will matter a lot on top of those models, but the base models, I think, will get pretty good. And that's why for us, the product piece is so important from the company perspective, but also from the value perspective. Because if you have a model that's great, but to actually connect your business logic and knowledge to be able to have the right interface for creating an ad for your work or a completely new material, that's a very different exercise. But open-source models are getting, if I split into two, more of that async content narration. I think narration is pretty much open-source is great, commercial models are great, the differences are getting smaller on the out-of-the-box quality. What most of the models haven't figured out, and I think we have, is how to make them controllable.

31:17 · So, that's the narration piece. I think the whole interaction piece of how you orchestrate the components together, whether that's a cascaded speech-to-text and LLM text-to-speech approach, or whether in the future it's a fused approach where you train them together. I think this is good for customer support or customer experience, but it's still away from conversation like we have and passing the Turing test. So, I think this is still at least a year away. Then you'll have real-time dubbing, kind of a variation of real-time translation conversation, and I think that's maybe more two years away. You know, a very uncomfortable belief that I feel comfortable having, but I think is uncommon in the market right now, is that actually most advantages in technology, they could last you a year or they could last you 10, but they're not infinitely defensible. If you think about that from a model quality perspective or a product perspective, they allow you to serve the customer better and build momentum and build scale for some period of time. And actually, that's really powerful over time, right? But it's not a clean forever answer. And so, I think that makes business people and investors uncomfortable.

32:32 · And it's very true as well. The way we think about it, research is a head start. This gives us an advantage. We can give advantage to the customer earlier, and it's a six to 12-month advantage. That is also a way for us to build the right product layer for you to get the best of that research. Frequently, we do that in parallel. So, the moment the research is out there, you have the product because we know our initiatives, we know what the product is. So, you have research and product in parallel that extends that. But the thing that will really give that long-term value is the ecosystem that you create around it, whether that's the run and distribution, whether that's the collection of voices you can have, the collection of integrations you can build, the workflows that you can build. I think that's the way we kind of sequence that in our mind: research, product, ecosystem that we build. And research is just a head start and being able to accelerate the future a little bit closer. I think that's a really powerful insight, especially if you know the research team and the company team believe that as well internally. I think the piece that's interesting for us, and I think this is the big question for all companies that do research in product, is do you wait for research, or do you do a product change? Or even not only research product companies, do you wait for someone else to do the research because the timeline for that isn't clear? Is it three months, six months, 12 months? Don't know exactly what it will do. Which is the hard choice: do I invest into the product layer, or do I just wait more for the research? In our case, we internally let all the product teams lead the research initiatives so we can parallelize that work. But we don't hold them back. If a product team thinks we should deliver value to the customer by doing something different, they can. A rough rule of thumb is three months. If we think it's going to be longer than three months, we will probably build it. If it's less than that, we probably won't. Can you talk about some of the research that you're doing now and how you think about the cadence of delivery and what's worth working on?

Research and Development Focus

34:31 · We have now a number of different initiatives across the audio space, and there are two big buckets, and roughly they will relate to the creative and agent side.

34:40 · On the creative side, what this means is text-to-speech models that are controllable. We then added a speech-to-text model that transcribes in a high-accuracy way, but across low-resource languages as well, covering almost 100 languages. Then we created a music model, a fully licensed music model. As you think about the future, it's how those models will also interact with some of the visual space.

35:02 · So, that's a lot of effort, and how you can get the best of audio and then potentially combine that with existing video that you have to really have the best delivery. Then on the agent side, it's of course how you optimize real-time speech-to-text, real-time text-to-speech. We just released our speech-to-text model, Scribe v2, which is under 150 milliseconds, 93.5% accuracy across the top 30 languages on FLORES, and it's only the top 30 here because we serve so many others, but most people don't. So, it's beating all the models on benchmarks. As you think about the future, it's also the orchestration piece of how you bring speech-to-text, LLM, and text-to-speech together. We are releasing, we will be releasing over the next couple of months, a new orchestration mechanism that will lower the end-to-end part in a great way. But the second thing, which is so hard, is it's not going to only allow you to combine those pieces, but also add the emotional context of the conversation, so you can actually respond with the model and be more expressive in a better way. In the future, and something we're investing in, is parallelizing speech-to-speech with a more fused approach as well. Of course, depending on the use case, if you are in an enterprise reliable use case, the cascaded approach is the approach for the next year too. It has more structure, more visibility into each of the steps, it's reliable, you can call tools. If you think more expressive and can hallucinate, speech-to-speech might be the choice, and maybe over time you'll see them kind of go one over another depending on the industry. But that's a huge investment on our side, which is where the foundation of all the platform and the main part that we are continually investing in is a plethora of different models that combine the best of audio with some of the best of the other modalities together.

36:52 · I want to take our last few minutes and ask you a few questions about just the future that I think you'll have a really good point of view on, given you think about voice and audio all the time. What do you think of AI companions?

Future of AI Companions and Education

37:04 · I think they will be a big thing and exist in a big way. It's not something I'm personally excited about or something that we spend much time on, but I think the whole line of what's an assistant, companion, character that you enjoy as part of an experience will kind of blur and blend to a large extent. They can be very common. But you're not enthusiastic personally about it? I'm more excited about like more of a Jarvis version of that, or like, I have a super assistant superpower, versus the social version. I think it would be such an incredible unlock, and it also blends into that personal context. I would love to start the day and have someone that understands me and tell me what's relevant to me, open the blinds, tell me about the weather and the sunshine, and play music straight away.

38:01 · It's going to happen.

38:01 · It's going to happen. That I'm excited for. I think the companion use cases will solve loneliness. In that part, I think that's one way. Maybe there are different ways of engaging people back. I do think there will be an interesting future even if you think about education, where you will have superpowers with learning from AI tutors. But I think on the flip side of that, and I think this will be my personal take, you will have education, a good percent of time spent with AI tutors, but then an explicit percent of time spent without any technology, human to human.

38:36 · So you can learn that part too.

38:38 · Yeah, I think this is the correct model.

38:40 · Both in terms of emotional guidance and coaching, and you know, guardrails, as well as peer-to-peer. Exactly. What do you think about dictation, or what happens in terms of how we control technology that isn't necessarily personified as well? Or does it just all become personified?

39:03 · I think not all personified. I think some, you know, communicating within the oven and home probably will stay pretty static or coded. I might just... Exactly. Like, you don't probably need that much additional emotional input. But I think it's going to be a huge part where, in a way, what I hope will happen is you will have the ability to stay more immersed in the real life with the devices going back into the pocket, back into some version of an attached element, assuming that's in the right setting, and that kind of acts on your behalf. In many ways, let's say dictation, as Karpathy says, "decade of agents." Let's call it a decade. Then you'll have a decade of robots. If you are interacting with robots, of course, voice will be the input and the output as one of the key interfaces. So, you will need that dictation as a huge part. But similarly, I think the robots are going to be personified.

39:58 · Yeah, 100%. 100%. Yeah. No, I think most of the use cases will be personified.

40:05 · Okay.

40:05 · Last one. What's one thing that you've seen already exist today, or if you project out a few years, will change about how we interact with content? Maybe it's personalized voice content, or just something people are going to do with AI voice that they don't do today, or that not everybody knows about.

40:24 · I think the biggest one that hasn't yet kicked into the system is how education will be done. I think learning with AI will be with voice, where it's on your headphone or in a speaker. It's just going to be such a big thing where you have your own teacher on demand, who understands you very personally, and kind of delivers the right content through your life. I think this will be one of the biggest use cases. I don't think it's happened yet. I think we have seen some of the commercial partners, but schools, universities, how that's deployed in a safeguarded way, in a way that supports the other part of education, the social part of education. I think all of that will evolve. Maybe there's a cool version of that where you have Richard Feynman or Albert Einstein deliver those lecture notes, or other teachers that you love. It will be sick.

41:14 · It's a great note to end on. Thanks for doing this, Motti.

41:16 · Thanks so much.

41:20 · Find us on Twitter at No Prior Pod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-briers.com.