Y Combinator - The Powerful Alternative To Fine-Tuning

Transcript

Intro

The world is changing so quickly. This is probably a little bit obvious, but you should just try things and every day do something with AI. Last summer, I took a weekend and used GPT5 to help me build an iPhone app. I hadn't done that in a decade. It is so fast and so easy. That was an age ago, or about eight months ago. Now it is even faster and easier. Do not limit yourself. Anything that you imagine, you should just try to use AI and see how far you can get with it, and you will be making the world better.

What Is Poetiq?

Welcome to another episode of the Light Cone. Ian Fischer is the co-founder and co-CEO of Poetic, which is building recursively self-improving AI reasoning harnesses for LLMs. Previously, he spent a decade as a researcher at Google DeepMind and founded a mobile devtools company through YC years ago. Welcome, Ian.

Thank you. I am so happy to be here.

What is Poetic? How is it different than RL? How is it different than context engineering?

Recursive Self-Improvement Explained

At Poetic, we are building a recursively self-improving system. Recursive self-improvement is the holy grail of AI where the AI is making itself smarter. The core insight that we had is that we could do recursive self-improvement far faster and cheaper than all of the other ways people had been proposing to do this. I cannot go into details about our particular approach, but most of the approaches out there involve training a new LLM from scratch. Training LLMs from scratch costs hundreds of millions of dollars and takes months of effort. Then Anthropic or OpenAI will come along and just eat your lunch in the next model release. Of course, Anthropic, OpenAI, and Google are exploring training recursive self-improvement, but typically at the level of having to train a new model for every step of self-improvement that they do.

The Fine-Tuning Trap

That seems like the defining thing that a startup really wants. I want to take advantage of whatever the next model is, but the second you are in fine-tuning land, you are spending millions to hundreds of millions of dollars, and then guess what? You just lit it on fire because the next version of the frontier model comes out, and you will never catch up. Whereas working with your systems means that you will always have the thing that is better than the thing that is out of the box, and that is sort of the holy grail.

We think that this is incredibly valuable to anybody who is building on top of large language models. We do not view the frontier models as competitors. They are the ones we are using as stilts to stand on top of, but if we did not have that foundational layer, then Poetic could not exist.

“Stilts” for LLMs

Being the smartest model is a game of inches, and those inches matter a lot.

How do we actually get started? You have built something that any startup could use, something that is like stilts. We have built a system that can automatically generate systems for your particular problem that will always outperform the underlying language models without the massive expense. About the bitter lesson, what would you have done without Poetic? You probably would have said, "Okay, we are going to first collect a large dataset, tens of thousands of examples for our particular problem that we are working on, and we are going to fine-tune the best model we can get our hands on." Maybe that is one of the frontier models, or maybe it is an open-weights model. You are going to spend a lot of money on that fine-tuning. The compute is so expensive. At the end of it, you have something that works better than the thing you fine-tuned on top of, but by then a new model has come out, and it is better than the thing you fine-tuned. You fine-tuned three years ago on top of GPT 3.5, and then GPT 4 comes out and it just blows you out of the water. Are you going to do that again, or are you going to go out of business? With Poetic, we give you a system, a harness, or a genetic system that sits on top of one or more language models, and it just performs better than them. When the new model comes out, that same harness is perfectly compatible with it, and you do not need to change anything to get an even bigger performance bump.

Recursive Self-Improvement vs. Fine-Tuning

Additionally, we can continue to optimize for this new model, whatever the new model is that you want to use, and make it even better. You do not lose out on hundreds of millions of dollars. In fact, we do this so much more cheaply than fine-tuning would cost.

Taking the Top Spot on ARC-AGI

You have done this a bunch of times. When you first came out with your paper in December of last year, you shot to the top of ARC AGI V2, and then you have done this a bunch of times for other benchmarks too. What was that like?

ARGI v2 was us coming out of stealth, letting people know that we could tackle these really hard problems. We wanted to show that our system could generate reasoning systems that are highly effective. Gemini 3 had just come out, and they were at the top of the leaderboard at 45%. Two days later, we released our results, showing that we could get a lot higher than that. They come out with SOTA, and then you come in right above them every single time, which is wild to see.

That is what it is like to have stilts. Whatever model comes out, you can be taller than that one with Poetic, which is awesome.

Yeah.

The interesting thing is that we were half the cost of Gemini 3 DeepThink because we were building on top of Gemini 3 Pro, which is a much cheaper model. We still got a nine percentage point improvement on the official verification. They were at 45%, and we were at 54%.

Beating Claude on Humanity’s Last Exam

So recently, you guys just announced some incredible results for Humanity's Last Exam. Can you tell us more about those?

Humanity's Last Exam is a set of 2500 really hard questions written by experts in many different domains. They are meant to be challenging even for PhDs in those fields. AI has not passed it yet, but we got to 55%, which is almost two percentage points higher than the previous state-of-the-art that came out just last week from Anthropic with Claude Opus 4.6. They got 53.1%, and we got 55%.

Humanity's Last Exam does not publish the cost of getting those results. In your case, this run was done with less than around six figures. How much was it?

We did not publish any cost for this, but I can say that the optimization cost us less than $100,000.

Which is impressive because each of these big foundation model training runs are in the hundreds of millions of dollars. And you guys, as a company, are only seven people.

That is right. Seven research scientists and research engineers.

That is impressive. The thing that is very interesting about your approach is taking a very scientific approach to the emergent behaviors that a lot of the best founders are doing with models. A lot of founders who get very good results for agents treat the underlying model as a common layer that you can switch in between. There are certain tasks, for example, for GPT 5.2, very hard to verify bugs get sent to that versus architecture that gets sent to Claude 4.6. You are kind of doing this automatically instead of having a human conducting it, which is very impressive. I think there is something more special going on underneath. Can you tell us a bit about how it works?

Yeah, it sounds magical. So, what can you tell us?

How the Meta-System Works

You are getting at a core thing. These harnesses are code, prompts, and data built on top of one or more language models. This is something that, in principle, you can build by hand or with cloud code. In practice, it takes a lot of work to have all the insights to make these work well. The core technology that we have developed at Poetic is recursive self-improvement.

We have a recursively self-improving system which we call the Poetic Meta-System. The output of that system is systems that solve hard problems. A hard problem is something that if you gave it to GPT 5.2, it would struggle to give you a reliable, robust result. This is a very big advantage for us. We can generate these systems in a much more automated manner.

This means that we can do it much more quickly and much more cheaply than if you hired a team yourself to try to make your own agent to solve your particular task. But not only that, since this is really an automated optimization process, if you have already done that work, you are a startup that is going after a particular vertical and you think you understand your problem pretty well. You have put together your agent, and it is working pretty well, but you know you can get something better, or you really need something better. You can bring that to us, and we can optimize that entire agent or pieces of that agent. We could optimize just the prompts, just the reasoning strategies.

There are a lot of different things that we can do depending on your particular needs.

Beyond RL: A New S-Curve

It sounds like this is a completely different paradigm than RL. We went through the S-curve of regular pre-training RL when OpenAI released 01, and now this feels like a new one. It sounds special. It sounds like it rhymes a lot with RNNs, which is a whole different paradigm than RL. It is going to depend on the particular task, the particular type of problem that we are going after that we are trying to solve, and the underlying models that we are working with. Effectively, you could say that each model or each set of models that we are working with will have its own S-curve. The Poetic system, the Poetic Meta-System itself, is also going to have its own S-curve. As the Poetic Meta-System gets better and as the underlying models get better, you will find that the S-curve that you are dealing with keeps shifting higher and higher until ultimately either you saturate or reach AGI.

Reach AGI, reach superintelligence.

Given its stilts, you might hit the ceiling first then.

That is the goal. You want to hit the ceiling first. I think a lot of startups that we work with, and in my spare time I do a bunch of context engineering. The thing is, we are sort of tuning it, tuning evals, tuning like we are context stuffing ourselves. What does that even feel like to have a recursively self-improving version of prompt engineering and context engineering? We do not spend a lot of time looking at the particular data that we are working with. Instead, we are letting the Poetic Meta-System look at that data. If the meta-system thinks that it needs to put more things into context, it will do more context stuffing or whatever. If it needs to generate a bunch of examples to get better performance, it will do that for you. It was pretty interesting to look at the prompt outputs in particular for ArcGI. You can read those and say, "That is not what a human would have written," pretty clearly. There is some unexpected stuff, and it made some really simple examples. One of the examples is actually wrong, but we did not change it.

Automating Prompt Engineering

We said, "This is the thing that it output; we will just leave it be." We do not want to go in and monkey around with things. Historically in machine learning, the rule was you have to know your dataset really well. Now we are kind of outsourcing that to the AI itself, where it is the AI's job to understand the dataset and figure out where the failure modes are and where the robust reasoning strategies are that the agent could use to get better performance.

How much of it is much better prompts, and how much of it is the harness itself, context stuffing, or summarizing in the right way, or reranking in the right way, so that you have some number of mega LLM calls, and then how do you get the most out of each of those calls?

From 5% to 95% Performance

That definitely varies per problem. Our last paper at DeepMind was not doing this recursive self-improving stuff, but we were showing that you could build these harnesses manually to solve really hard problems.

We manually optimized the prompts really hard for these very hard problems. That got us a little bit of the way. In this particular case, on the hardest task we were working on, we got to about 5% performance with Gemini 1.5 Flash. Then, when we added on the reasoning strategies, we went from 5% to 95%. This is typically what we see. Many people are doing some amount of automated prompt optimization. GPT is this very popular paper, and everybody is implementing that. That will get you some performance improvements, but it is very far from everything that you can get if you actually think about these reasoning strategies that are really going to be written in code rather than in just better prompts.

Early Access & Putting Your Agent on Stilts

So if startups want to use Poetic to put their agent on stilts, what should they do?

Right now, we have not released anything yet, but if you go to poetic.ai, there is a button you can click to sign up for early access. If you are a startup or a company that has a really hard problem and you have tried everything you can to make it reliable and robust and you just cannot get all the way there, you need something more, then let us know. We are looking for problems like that. Just tell us what it is that you are working on, and we will reach out. You will be the first to know when we are ready to work with you. If you are at the top of Humanity's Last Exam, that is pretty big. You are already all the way out there at SOTA, and then the stilts basically let any agentic company become SOTA.

That is the idea. We view the ArcGI results and the Humanity's Last Exam results as showing two different capabilities that we have. We can really improve your reasoning, and we can really improve deep knowledge extraction from these models. Then you are just totally vaccinated against the bitter lesson.

Exactly.

YC's next batch is now taking applications. Got a startup in you?

Apply at ycombinator.com/apply. It is never too early, and filling out the app will level up your idea. Okay, back to the video.

From YC Founder to DeepMind Researcher

Slight change of topic, but something I was curious about. You arrived at Google over a decade ago when they acquired your first YC startup, Portable. Portable was porting mobile apps cross-platform, right? It is quite different from recursively self-improving AGI. How did you make that leap? What happened once you got to Google? What made you think that you maybe wanted to shift out and do something different?

I would love to hear that story.

The acquisition was this amazing opportunity to reflect on what I really wanted to be doing next. Google is a place where you can do so many different things. I spent some time thinking about where I wanted to go next in my journey. I realized that the problems that I was most excited about were really AI and robotics.

The best people in the world, many of them in those fields, were at Google at the time. I went and talked to them. They let me come join a new AI robotics team in Google Research. That was an amazing opportunity for me since that was not my background. My background was computer security and then this cross-platform mobile systems building stuff. I was able to join this team. I will tell you the truth: I very quickly realized that hardware is hard, and I did not really want to be doing robotics. Robotics was more aspirational at that moment, but I was really passionate about machine learning. So I made a very hard switch into just doing machine learning research and did that for about a decade at Google and then DeepMind.

What is maybe some advice that you have today for engineers who want to get into more of the AI side, probably the applied AI, and build startups around AI? How should they think about that?

The world is changing so quickly. This is probably a little bit obvious, but you should just try things and every day do something with AI. Always try to push yourself to find the boundaries of what they are capable of, and build the things that you want to build. Even for me, last summer I took a weekend and used GPT5 to help me build an iPhone app. I had not done that in a decade.

Advice for Engineers in the AI Era

It is so fast and so easy. That was an age ago, or eight months ago. Now it is even faster and easier.

Do not limit yourself. Anything that you imagine, you should just try to use AI and see how far you can get with it, and you will be making the world better. That is all we have time for today, but Ian, thank you so much for giving us all stilts. We cannot wait to use it at YC. I cannot wait to use it for Gary's list. There is just so much to do. Thank you for having me. This was a lot of fun.