Transcript Reader Lenny's Podcast
Library
Builder transcript 中文已完成

The $1B Al company training ChatGPT, Claude & Gemini on the path to responsible

Read the source conversation in a calm, mobile-friendly layout.

ChannelLenny's Podcast
Language中文
SourceYouTube
Coverage100%
0% 章节 01
Video Source The $1B Al company training ChatGPT, Claude & Gemini on the path to responsible

Lenny's Podcast

https://www.youtube.com/watch?v=dduQeaqmpnI
Reading Mode

默认显示中文,缺失的章节会自动回退到英文原文,保证这页随时可读。

章节 01 / 10

第01节

中文 中文暂未完整,先显示英文原文

Lenny RachitskyYou guys hit a billion in revenue in less than four years with around 60 to 70 people. You're completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before.

Edwin ChenWe basically never wanted to play the Silicon Valley game. I always thought it was ridiculous. I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of the people and we would move faster because the best people wouldn't have all these distractions. So when we start Surge, we wanted to build it completely differently with a super small, super elite team.

Lenny RachitskyYou guys are by far the most successful data company out there.

Edwin ChenWe essentially teach AI models what's good and what's bad. People don't understand what quality even means in this space. They think you could just throw bodies at a problem and get good data, that's completely wrong.

Lenny RachitskyTo a regular person, it doesn't feel like these models are getting that much smarter constantly.

Edwin ChenOver the past year, I've realized that the values that the companies have will shape the model. I was asking Claude to help me drop an email the other day. And after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized that I spent 30 minutes doing something that didn't matter at all. If you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations or do you want a model that's optimizing for your time and productivity and just says, "No. You need to stop. Your email's great. Just send it and move on"?

Lenny RachitskyYou have this hot take that a lot of these labs are pushing AGI in the wrong direction.

Edwin ChenI'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, we are optimizing for AI slop instead. But we're optimizing your models for the types of people who buy tabloids at a grocery store. We're basically teaching our models to chase dopamine instead of truth.

Lenny RachitskyToday, my guest is Edwin Chan, founder and CEO of Surge AI. Edwin is an extraordinary CEO and Surge is an extraordinary company. They're the leading AI data company, powering training at every frontier AI lab. They're also the fastest company to ever hit $1 billion in revenue in just four years after launch with fewer than 100 people and also completely bootstrapped. They've never raised a dollar in VC money, they've also been profitable from day one.

My podcast guests tonight love talking about craft, and taste, and agency, and product market fit. You know what we don't love talking about? SOC 2. That's where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industry-leading AI, automation, and continuous monitoring. Whether you're a startup tackling your first SOC 2 or ISO 27001 or an enterprise managing vendor risk, Vanta's trust management platform makes it quicker, easier, and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner.
The result, according to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get $1,000 off at vanta.com/lenny.
Here's a puzzle for you. What do OpenAI, Cursor, Perplexity, Vercel, Plad, and hundreds of other winning companies have in common? The answer is they're all powered by today's sponsor, WorkOS. If you're building software for enterprises, you've probably felt the pain of integrating single sign-on, Skim, RBAC, audit logs, and other features required by big customers. WorkOS turns those deal blockers into drop-in APIs with a modern developer platform built specifically for B2B SaaS.
Whether you're a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise-ready and unlocking growth. They're essentially Stripe for enterprise features.
Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions superfast. WorkOS allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience. Go to workos.com to make your app enterprise ready today.
Edwin, thank you so much for being here and welcome to the podcast.

Edwin ChenThanks so much for having me. I'm super excited.

Lenny RachitskyI want to start with just how absurd what you've achieved is. A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI, and you guys have done this in a way that is unprecedented. You guys hit a billion in revenue in less than four years with less than 60, around 60 to 70 people, you're completely bootstrapped, haven't raised any VC money, I don't believe anyone has ever done this before, so you guys are actually achieving the dream of what people are describing will happen with AI. I'm curious just, do you think this will happen more and more as a result of AI? And also just where has AI most helped you find leverage to be able to do this?

Edwin ChenYeah, so we hit over a billion of revenue last year with under 100 people. And I think we're going to see companies with even crazier ratios, like 100 billion per employee in the next few years. AI is just going to get better and better and make things more efficient so that ratio just becomes inevitable.

I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people wouldn't have all these distractions. And so when we started Surge, we wanted to build it completely differently with a super small, super elite team, and yeah, what's crazy is that we actually succeeded. And so I think two things are colliding.
One is that people are realizing that you don't have to build giant organizations in order to win.
And two, yeah, all these efficiencies from AI. And they're just going to lead to a really amazing time in company building.
The thing I'm excited about is that the types of companies are going to change too. It won't just be that they're smaller, we're going to see fundamentally different companies emerging. If you think about it, fewer employees means less capital. Less capital means you don't need a raise. So instead of companies started by founders who are great at pitching and great at hyping, you'll get founders who are really great at technology and product.
And instead of products optimized for revenue and what VCs want to see, you'll get more interesting ones built by these tiny obsessed teams. So people building things they actually care about, real technology and real innovation. So I'm actually really hoping that the slick on , it'll go back to being updates for hackers again.

Lenny RachitskyYou guys have done a lot of things in a very contrarian way, and one was actually just not being on LinkedIn, posting viral posts, not on Twitter, constantly promoting Surge. I think most people hadn't heard of Surge until just recently, and then you just came out, and like, okay, the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.

Edwin ChenWe basically never wanted to play the Silicon Valley game. And like I always thought it was ridiculous. What did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code and your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel? And it definitely made things more difficult for us, because yeah, when you fundraise, you just naturally get part of this kind of Silicon Valley industrial complex where people will, your VCs will tweet about you. You'll get the tech runs outlines, you'll get announced in all of the newspapers because you raised at this massive valuation. And so it made things more difficult us because the only way we were going to succeed was by building a 10 times better product and getting word of mouth from researchers. But I think it also meant that our customers were people who really understood data and really cared about it.

I always thought it was really important for us to have early customers who were really aligned with what we were building, and who really cared about having really high quality data, and really understood how that data would make their AI models so much better because they were the ones helping us. They were the ones giving us feedback on what we're producing. And so just having that kind of very close mission alignment with our customers actually helped us early on. So these are people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw something in that current . So it made things harder for us, but I think in a really good way.

Lenny RachitskyIt's such an empowering story to hear this journey for founders that they don't need to be on Twitter all day promoting what they're doing. They don't have to raise money. They can just kind of go heads down and build, so I love so much about the story of Surge. For people that don't know what Surge does, just to give us a quick explanation of what Surge is.

Edwin ChenWe essentially teach AI models what's good and what's bad. So we train them using human data, and there's a lot of different products that we have, like SFT, RHF, rubrics, verifiers, RL environments, and so on and so on, and then we also measure how well they're progressing. So essentially we're a data company.

Lenny RachitskyWhat you always talk about is the quality has been the big reason you guys have been so successful, the quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?

Edwin ChenI think most people don't understand what quality even means in this space. They think you could just throw bodies at a problem and get good data and that's completely wrong. Let me give you an example.

So imagine you wanted to train a model to write an eight line poem about the moon. What makes it a good, high-quality poem? If you don't think deeply about quality, you'll be like, "Is this a poem? Does it contain eight lines? Does it contain the word, moon?" You check all of these boxes, and if so, sure. Yeah, you say it's a great problem. But that's completely different from what we want. We are looking for a Nobel Prize-winning poetry. Is this poetry unique? Is it full of subtle imagery? Does it surprise you and target your heart? Does it teach you something about the nature of moonlight? Does it playthrough emotions? And does it make you think? That's what we are thinking about when we think about high quality poem.
So it might be like a haiku about moonlight on water. It might use internal rhyme and meter. There are a thousand ways to write a poem about the moon, and in each one, gives you all these different insights into language, and imagery, and human expression, and I think thinking about quality in this way is really hard, it's hard to measure. It's really subjective, and complex, and rich. And it sets a really high bar. And so we have to build all of this technology in order to measure it, like thousands of signals on all of our workers, thousands of signals on every project, every task. We know at the end of the day, if you are good at writing poetry versus good at writing essays versus great at writing technical documentation. And so we have to gather all these signals on what your background is, what your expertise is, and not just that. Like how you're actually performing when you're writing all these things, and we use those signals to inform whether or not you are good for these projects, and whether or not you are improving the models.
And it's really hard, and so to build all this technology to measure it, but I think that's exactly what we want AI to do, and so we have these really deep notions about quality that we're always trying to try and achieve.

Lenny RachitskySo what I'm hearing is there's kind of just going much deeper in understanding what quality is within the verticals that you are selling data around. And is this like a person you hire that is incredibly talented at poetry plus evals that they, I guess, help write, that tell them that this is great? What's the mechanics of that?

Edwin ChenThe way it works is we essentially gather thousands of signals about everything that you're doing when you're working on platform. So we are looking at your keyboard strokes. We are looking how fast you answer things. We are using reviews, we are using code standards, we are using... We're training models ourselves all on the outputs that you create, and then we're seeing whether they improve the model's performance.

And so in a very similar way to how Google search, like when Google search is trying to determine what is a good webpage, there's almost two aspects of it. One is you want to remove all of the worst of the worst webpages. So you want to remove all the spam, all the just low quality content, all the pages that don't load, and so it's almost like a content moderation problem. You just want to remove the worst of the worst.
But then you also want to discover the best of the best. Okay, like this is the best webpage or just the best person for this job. They are not just somebody who writes the equivalent of high school level poetry. Again, they're not just writing poetry that checks all these boxes, checks all of these explicit instructions, but rather, yeah, they're writing poetry that makes you emotional. And so we have all these signals as well that, again, completely differently from moving the worst of the worst, we are finding the best of the best. And so we have all these signals...
Again, just like Google Search uses all these signals that feeds them into their ML algorithms and uses and predicts certain types of things, we do the same with all of our workers and all of our tasks in all of our projects. And so it's almost like a complicated machine learning problem at the end of the day, and that's how it works.

Lenny RachitskyThat is incredibly interesting.

I want to ask you about something I've been very curious about over the past couple years. If you look at Claude, it's been so much better at coding and at writing than any other model for so long. And it's really surprising just how long it took other companies to catch up. Considering just how much economic value there is there, just like every AI coding product sat on top of Claude because it was so good Claude code and writing also. What is it that made it so much better? Is it just the quality of the data they trained on or is there something else?

Edwin ChenI think there are multiple parts to it. So a big part of it certainly is the data. I think people don't realize that there's almost like this infinite amount of choices that all the frontier labs are deciding between when they're choosing what data goes into their models. It's like, okay, are you purely using human data? Are you gathering the human data in X, Y, Z way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you?

For example, in the coding realm, maybe you care more about front end coding versus back end coding. Maybe when you're doing front end coding, you care a lot about the visual design of the front end applications that you're creating, or maybe you don't care about it so much and you care more about, I don't know, the deficiency of it or the pure correctness over that visual design.
And then other questions like, okay, are you carrying how much synthetic data are we throwing into the mix? How much do you care about these 20 different benchmarks?"
Some companies, they see these benchmarks and they're like, "Okay, for PR purposes, even though we don't think that these academic benchmarks matter all that much, maybe we just need to optimize for them anyways because our marketing team needs to show certain progress on certain standard evaluations that every other company talks about, and if we don't show good performance here, it's going to be bad for us even if ignoring these academic benchmarks makes us better at the real tasks."
Other companies are going to be principled and be like, "Okay, yeah, no, I don't care about marketing. I just care about how my model performs on these real world tasks at the end of the day, and so I'm going to optimize for that instead."
And it's almost like there's a trade-off between all of these different things, and there's like a...
One of the things I often think about is that there's a... It's almost like there's an art to post training. It's not purely a science. When you are deciding what kind of model you're trying to create and what it's good at, there's this notion of taste and sophistication, like, "Okay, do I think that these..."
So going back to the example of how good the model is at visual design. I'm like, "Okay, maybe you have a different notion of visual design than what I do. Maybe you care more about minimalism, and you care more about, I don't know, 3D animations than I do. And maybe this other person prefers things that look a little bit more broke." And there's all these notions of taste sophistication that you have to decide between when you're designing your post training mix, and so that matters as well.
So long story short, I think there's all these different factors, and certainly the data is a big part of it, but it's also like what is the objective function that you're trying to optimize your model towards?

Lenny RachitskyThat is so interesting. The taste of the person leading this work will inform what data they ask for, what data they feed it. But it's wild it shows the value of great data. Anthropic got so much growth and win from essentially better data.

Edwin ChenYeah, exactly.

Lenny RachitskyAnd I could see why companies like yours are growing so fast. There's just so much... And that's just one vertical. That's just coding, and then there's probably a similar area for writing. I love that it's... It's interesting that AI, it feels like this artificial computer binary thing, but it's like taste. Human judgment is still such a key factor in these things being successful.

Edwin ChenYep, exactly. Again, going back to the example I said earlier, certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list.

But again, I don't think that makes for good poetry, so certain frontier labs, the ones with more taste in sophistication, they will realize that it doesn't reduce to this six set of checkboxes and they'll consider all of these kind of implicit, very subtle qualities instead, and I think that's what makes them better at this at the end of the day.

Lenny RachitskyYou mentioned benchmarks. This is something a lot of people worry about is there's all these models that are always... Basically, it feels like every model is better than humans at every STEM field at this point, but to a regular person, it doesn't feel like these models are getting that much smarter constantly. What's your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?

Edwin ChenYeah, so I don't trust the benchmarks at all. And I think that's for two reasons. So one is I think a lot of people don't realize, even researchers within the community, they don't realize that the benchmarks themselves are often honestly just wrong. They have wrong answers. They're full of all this kind of messiness and people trust... Long as for the popular ones, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don't realize. So that's one part of it.

And the other part of it is these benchmarks at the end of the day, they often have well-defined objective answers that make them very easy for models to hill-climb on in a way that's very different from the messiness and ambiguity of the real world.
I think one thing that I often say is that it's kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs. And that's because, yeah, even though IMO gold medals seem hard to the average person, yeah, they are hard at the end of the day. But they have this notion of objectivity that, okay, yeah, parsing a PDF sometimes doesn't have. And so it's easier for the frontier labs to hill-climb on all of these than to solve all these mess ambiguous problems in the real world. So I think there's a lack of direct correlation there.

Lenny RachitskyIt's so interesting the way you described it is hitting these benchmarks is kind of like a marketing piece. When you launch, say Gemini 3 just launched, and it's like, cool. Number one with all these benchmarks. Is that what happens? They just kind of train their models to get good at these very specific things?

Edwin ChenYeah, so there's, again, maybe two parts to this. So one is, sometimes, yeah, these benchmarks, they accidentally leak in certain ways or the frontier labs will tweak the way they evaluate their models on these benchmarks. They'll tweak your system prompt or they'll tweak the number of times they run their model, and so on and so on in a way that games these benchmarks.

The other part of it though is it's like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark and, yeah, it's basically another form of gaming it.

Lenny RachitskyKnowing that with that in mind, how do you get a sense of if we're heading towards AGI, how do you measure progress?

Edwin ChenYes, so the way we really care about measuring model progress is by running all these human evaluations.

So for example, what we do is, yeah, we will take Gore human annotators, and we'll ask them, "Okay, go have a conversational model." And maybe you're having this conversation with the model across all of these different topics. So you are a Nobel Prize winning physicist. So you go have a conversation about pushing different tier of your own research. You are a teacher and you're trying to create lesson plans for your students, so go talk to the model about these things. Or you're a coder and you're working at one of these big tech companies, and you have these problems every day, so go talk to the model and see how much it helps you.
And because or searchers or annotators, they are experts at the top of their fields, and they are not just giving your responses, they're actually working through the responses deeply themselves, they are... Yeah, they're going to evaluate the code that it write. They're going to double check the physics equations that it writes. They're going to evaluate the models in a very deep way, so they're going to pay attention to accuracy and instruction following, all these things that casual users don't when you suddenly get a popup on your ChatGPT response asking you to compare these two different responses. People like that, they're not evaluating models deeply, they're just vibing and picking whatever response looks flashiest or are looking closely at responses and evaluating them for all of these different dimensions, and so I think that's a much better approach than these benchmarks or these random outline AV tests.

Lenny RachitskyAgain, I love just how central humans continue to be in all this work that we're not totally done yet. Is there going to be a point where we don't need these people anymore, that AI is so smart that, "Okay, we're good. We got everything out of your heads"?

Edwin ChenYeah, I think that will not happen until we've reached AGI. It's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from, and so, yeah, I don't think that's going to happen anytime soon.

Lenny RachitskyOkay, cool. So more reason to stress about AGI. "We don't need these folks anymore."

I can't not ask just... People that work closely with this stuff, I'm always just curious. What's your AGI timelines? How far do you think we are from this? Do you think we're in like a couple years or is it like decades?

Edwin ChenSo I'm certainly on the longer time horizon front. I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance, and so on, and so on. And so in my head, I probably bet that within the next one or two years, yeah, the models are going to automate 80% of the average LL6 software engineer's job. It's going to take another few years to move to 90%, and another fewer to 99%, and so on, and so on. So I think we're closer to a decade or decades away than .

Lenny RachitskyYou have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction and this is based on your work at Twitter, and Google, and Facebook. Can you just talk about that?

Edwin ChenI'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, all these big grand questions, we are optimizing for AI slop instead. We're basically teaching our models to chase dopamine instead of truth. And I think this relates to what we're talking about regarding these benchmarks. So let me give you a couple examples.

So right now, the industry is played by these terrible databoards like LLM Arena. It's this popular online leaderboard where random people from around the world vote on which AI response is better. But the thing is, like I was saying earlier, they're not carefully reading or fact-checking. They're skimming these responses for two seconds and picking whatever looks flashiest.
So a model can hallucinate everything. It can completely hallucinate. But it will look impressive because it has crazy emojis, and boating, and markdown headers, and all these superficial things that don't matter at all, but it catch your attention. And these LLM-reading users love it. It's literally optimizing your models for the types of people who buy tabloids at the grocery store. We've seen this data ourselves. The easiest way to climb LLM Arena, it's adding crazy boating. It's doubling the number of emojis. It's tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong.
And the problem is, again, because all of these frontier labs, they kind of have to pay attention to PR because their sales team, when they're trying to sell to all these enterprise customers, those enterprise customers will say, "Oh, well, but your model's only number five on LLM Arena, so why should I buy it?" They have to, in some sense, pay attention to these leaderboards, and so what their researchers tell us is like they'll say, "The only way I'm going to get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing it is probably going to make my model worse and accuracy following." So I think there's all these negative incentives that are pushing work in the wrong direction.
I'm also worried about this trend towards optimizing AI for engagement. I used to work on social media. And every time we optimize for engagement, terrible things happened. You'd get clickbait and pictures of bikinis and bigfoot and horrifying skin diseases just filling your feeds. And I think I worry that the same thing's happening with AI. If you think about all the sycophancy issues with ChatGPT, "Oh, you're absolutely right. What an amazing question," the easiest way to hook users is to tell them how amazing they are. And so these models, they constantly tell you you're a genius. They'll feed into your delusions and conspiracy theories. They'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you're having with it. And so yeah, companies are spending all the time hacking these leaderboards and benchmarks, and the scores are going up, but I think it actually masks up the models with the best scores, they are often the worst or just have all these fundamental failures. So I think I'm really worried that all of these negative ascendants are pushing AGI into the wrong direction.

Lenny RachitskySo what I'm hearing is AGI is being slowed down by these, basically the wrong objective function, these labs paying attention to the wrong basically benchmarks and evals.

Edwin ChenYep.

Lenny RachitskyI know you probably can't play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?

Edwin ChenI would say I've always been very impressed by Anthropic. I think Anthropic takes a very principled view about what they do and don't care about and how they want their models to behave in a way that feels a lot more principle to me.

Lenny RachitskyInteresting.

Are there any other big mistakes you think labs are making just that are kind of slowing things down or heading in the wrong direction? Where we've heard just chasing benchmarks, this engagement focus, is there anything else you're seeing of just like, "Okay, we got to work on this because it'll speed everything up"?

Edwin ChenI think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity. I think a lot about Sora and...

Lenny RachitskyI was thinking that's what you're imagining.

Edwin ChenYeah, what it entails, and so it's kind of interesting. It's like which companies would build Sora and which wouldn't?

And I think that answer to that... Well, I don't know if answer is myself. I have an idea in my head, but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they want to achieve, yeah, so I think about that a lot.

Lenny RachitskyThe steel man argument there is, it's like fun, people want it, it'll help them generate revenue to grow this thing and build better models, it'll train data in an interesting way, it's also just really fun.

Edwin ChenYeah. I think it's almost like, do you care about how you get there? And in the same way, so I made this tabloid analogy earlier, but would you sell tabloids in order to fund, I don't know, some other newspaper?

Sure, like in some sense, if you don't care about the path, then you'll just do whatever it takes, but it's possible that it has negative consequences in of itself that will harm the long-term direction of what you're trying to achieve, and maybe it'll distract you from all the more important things, so yeah, I think that the path you take matters a lot as well.

Lenny RachitskyAlong these lines, you talked a bunch about this of just Silicon Valley and kind of the downsides of raising a lot of money being in the echo chamber. What do you call it, the Silicon Valley machine? You talk about how it's hard to build important companies in this way and that you might actually be much more successful if you're not going down the VC path. Can you just talk about what you've seen in that experience and your advice essentially to founders, because they're always hearing? Raise money from fancy VCs, move to Silicon Valley, what's kind of the countertake?

Edwin ChenYes. So I've always really hated a lot of the Silicon Valley mantras. The standard playbook is to get product market fit by pivoting every two weeks. And to chase growth and chase engagement with all of these dark patterns and to blitz scale by hiring as fast as possible. And I've always disagreed.

So yeah, I would say don't pivot. Don't put scale. Don't hire that Stanford grad who simply wants to add a hot company to your resume, just build the one thing only you could build, a thing that wouldn't exist without the insight and expertise that only you have.
And you see these buy to companies everywhere now. Some founder who was doing crypto in 2020, and then pivoted to NFTs in 2022, and now they're an AI company. There's no consistency, there's no mission, they're just chasing valuations. And I've always hated this because Silicon Valley loves to score on Wall Street for focusing on money. But honestly, most of the Silicon Valley's chasing the same thing. And so we stayed focused on our mission from day one, pushing that frontier of high quality complex data, and I've always loved that because I think startups...
I have this very romantic notion of startups. Startups are supposed to be a way of taking big risks to build something that you really believe in. But if you're constantly pivoting, you're not taking any risks. You're just trying to make a quick buck. And if you fail because the market isn't ready yet, I actually think that's way better. At least you took a swing at something deep, and novel, and hard instead of pivoting into another LLM wrapper company. So yeah, I think the only way you build something that matters that's going to change the world is if you find a big idea you believe in and you say no to everything else.
So you don't keep on pivoting when it gets hard, you don't hire a team of 10 product managers because that's what every other cookie cutter startup does, you just keep building that one company that wouldn't exist without you. And I think there are a lot of people in Silicon Valley now who are sick of all the grift, who want to work on big things that matter with people who actually care, and I'm hoping that that would be the future of how we go with technology.

Lenny RachitskyI'm actually working on a post right now with Terrence Rohan, this VC that I really like to work with, and we interviewed five people who picked really successful generational companies early and joined them as really early employees. They joined OpenAI before anyone thought it was awesome, Stripe before anyone knew was awesome, and so we're looking for patterns of how people find these generational companies before anyone else, and it aligns exactly what you described, which is ambition. They have a wild ambition with what they want to achieve. They're not, as you said, just kind of looking around for product market fit no matter what ends up being, and so I love that what you described very much aligns with what we're seeing there.

Edwin ChenYeah, I absolutely think that you have to have huge ambitions, and you have to have a huge belief in your idea that's going to change the world, and you have to be willing to double down and keep on doing whatever it takes to make it happen.

Lenny RachitskyI love how counter your narrative is to so many of the things people hear, and so I love that we're doing this. I love that we're sharing this story.

Imagine starting a project at work. And your vision is clear, you know exactly who's doing what, and where to find the data that you need to do your part. In fact, you don't have to waste time searching for anything because everything your team needs from project trackers and OKRs, the documents and spreadsheets lives in one tab all in Coda.
With Coda's collaborative all in one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI all in one easy to organize tab. Like I mentioned earlier, I use Coda every single day. And more than 50,000 teams trust Coda to keep them more aligned and focused. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time.
To try it for yourself, go to coda.io/lenny today and get six months free of the team planned for startups. That's coda.io/lenny to get started for free and get six months of the team plan, coda.io/lenny.
Slightly different direction, but something else that was maybe a counter narrative. I imagine you watched the Dwarkesh and Richard Sutton podcast episode, and even if you didn't, they basically had this conversation, Richard Sutton. He was a famous AI researcher, had this whole bitter lesson meme, and he talked about how LLMs almost are kind of a dead end, and he thinks we're going to really plateau around LLMs because of the way they learn.
What's your take there? Do you think LLMs will get us to AGI or beyond, or do you think there's going to be something new or a big breakthrough that needs to get us there?

Edwin ChenI'm in the camp where I do believe that something new will be needed. The way I think about it is when I think about training AI, I take a very... I don't know if I would say biological point of view. But I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all of those ways as well. And maybe they'll have a different distribution of the focuses that they have. I know that it'll be different for humans, so maybe they have a different distribution, but we want to be able to mimic their learning abilities of humans and make sure that we have the algorithms and the data for models to learn in the same way. And so to the extent that LLMs have different ways of learning from humans, then yeah, I think something new will be needed.

Lenny RachitskyThis connects to reinforcement learning. This is something that you're big on and something I'm hearing more and more is just becoming a big deal in the world of post-training. Can you just help people understand what is reinforcement learning and reinforcement learning environments, and why they're going to be more and more important in the future?

Edwin ChenReinforcement learning is essentially training your model to reach a certain reward. And let me explain what an RL environment is. An RL environment is essentially a simulation of real world. So think of it like building a video game with a fully fleshed out universe. Every character has a real story, every business has tools and data you can call, and you have all these different entities interacting with each other.

So for example, we might build a world where you have a startup with Gmail messages, and Slack threads, and Jira tickets, and GitHub PRs, and a whole code base. And then suddenly AWS goes down. And Slack goes down. And so, "Okay. Model, well, what do you do?" The model needs to figure it out.
So we give them models tasks in these environments, we design interesting challenges for them, and then we run them to see how they perform. And then we teach them, we give them these rewards when they're doing a good job or a bad job.
And I think one of the interesting things is that these environments really showcase where models are weak at end-to-end tasks in real world. You have all these models that seem really smart on isolated benchmarks, they're good at single step tool calling. They're good at single step instruction following. But suddenly you dump them into these messy worlds where you have confusing Slack messages and tools they've never seen before, and they need to perform right actions and modify the and interact over longer time horizons where what they do in step one affects what they do in step 50. And that's very different from these kind of academic single step environments that they've been in before, and so the model just fails catastrophically in all these crazy ways.
So I think these RL environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in real world, and so they'll hopefully get better and better at real tasks compared to all these contrived environments.

Lenny RachitskySo I'm trying to imagine what this looks like. Essentially, it's like a virtual machine with, I don't know, a browser or a spreadsheet or something in it with like, I don't know, surge.com. Is that your website, surge.com? Let's make sure we get that right.

Edwin ChenSo we are actually surgehq.ai.

Lenny RachitskySurgehq.ai. Check it out. We're hiring it, I imagine. Yes. Okay. So it's like, cool, here's surgehq.ai. Your job, here's your job as an agent, let's say, is to make sure it stays up. And then all of a sudden it goes down and the objective function is figure out why. Is that an example?

Edwin ChenYeah, so the objective function might be... Or the goal of the task might be, okay, go figure out why and fix it. And so the objective function might be, it might be passing a series of unit tests, it might be writing a document like maybe it's a retro containing certain information that matches exactly what happened, there's all these different rewards that we might give it that determine whether or not it's succeeding, and so the models, we're basically teaching the models to achieve that reward.

Lenny RachitskySo essentially it's off and running. Here's your goal, figure out why the site went down and fix it. And it just starts trying stuff, we're using everything, all the intelligence it's got, it makes mistakes, you kind of help it along the way, reward it if it's doing the right sort of thing. And so what you're describing here is this is the next phase of models becoming smarter. More RL environments focused on very specific tasks that are economically valuable, I imagine.

Edwin ChenYeah, so just in the same way that there were all these different methods for models of learning in the past, originally we had SFT and RHF, and then we had rubrics and verifiers. This is the next stage, and it's not the case that the previous methods are obsolete, this is, again, just a different form of learning that compliments all the previous types, so it's just like a different skilled model not only to learn how to do.

Lenny RachitskyAnd so in this case, it's less some physics PhD sitting around talking to a model, correcting it, giving it evals of here's what the correct answer is, creating rubrics and things like that. More it's like this person now designing an environment. So another example I've heard is like a financial analyst. Just like, "Here's an Excel spreadsheet, here's your goal, figure out our profit and loss," or whatever. And so this expert now, instead of just sitting around writing rubrics, they're designing this RL environment.

Edwin ChenYeah, exactly. So that financial analyst might create a spreadsheet, they may create certain tools that the model needs to call in order to help fill out that spreadsheet, like it might be, okay, the model needs to access Bloomberg terminal. It needs to learn how to use it. And it needs to learn how to use this calculator. And it needs to learn how to pour on this calculation. So it has all these tools that it has access to.

And then the reward might be... Okay, it's like maybe I will download that spreadsheet and I want to see, does cell B22 contain the correct profit and loss number? Or does tab number two contain this piece of information?

Lenny RachitskyAnd what's interesting, this is a lot closer to how humans learn. We just try stuff, figure out what's working and what's not. You talk about how trajectories are really important to this. It's not just here's the goal and here's the end, it's like every step along the way. Can you just talk about what trajectories are and why that's important to this?

Edwin ChenI think one of the things that people don't realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate trajectory, it may have tried 50 different times and failed, but eventually it just kind of randomly lands on a correct number. Or maybe it is...

Sometimes it just does things very inefficiently or it almost reward-hacks a way to get at the correct answer, and so I think paying attention to the directory is actually really important. And I think it's also really important because some of these trajectories can be very long. And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing.
Sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you want it to get it at the correct answer by just one-shotting it. And if you ignore all of that, it's just like teaching it... just missing a lot of the information that you could be teaching a model to do.

Lenny RachitskyI love that. Yeah, it tries a bunch of stuff and eventually gets it right. You don't want it to learn this is the way to get there. There's often a much more efficient way of doing it.

You mentioned all the kind of the steps we've taken along the journey of helping models get smarter. Since you've been so close to this for so long, I think this is going to be really helpful for people. What's kind of like been the steps along the way from the first post-training that has most helped models advance? Where do evals fit in the RL environments? Just like what's been the steps and now we're heading towards RL environments?

Edwin ChenOriginally, the way models started getting post-trained was purely through SFT. And-

Lenny RachitskyWhat does that stand for?

Edwin ChenSo SFT stands for supervised fine-tuning. So again, I think often in terms of these human analogies, and so SFT is a lot like mimicking a master and copying what they do.

And then RLHF became very dominant. And analogy there would be like sometimes you learn by writing 55 different essays and someone telling you which one they liked the most.
And then I think over the past year or so, rubrics and verifiers have become very important. And rubrics and verifiers are like learning by being graded and getting detailed feedback on where you went wrong.

Lenny RachitskyAnd those are evals, another word for that?

Edwin ChenYeah. So I think evals often covers two terms. One is you are using the evaluations for training because you're evaluating whether or not the model did a good job, and when it does do a good job, you're rewarding it.

And then there's this other notion of evals where you're trying to measure the model's progress like, okay, yeah, I have five different candidate checkpoints and I want to pick the one that's best in order to release it to the public. So going to run all these evals on these five different checkpoints in order to decide which one is best.

Lenny RachitskyAwesome.

Edwin ChenYeah, and now we have RL environments, so this is kind of like a hot new thing.

Lenny RachitskyAwesome. So what I love about this business journey is just there's always something new. There's always this like, okay. We're getting so good at just all this beautiful data for companies and now they need something completely different. Now we're setting up all these virtual machines for them and all these different use cases.

Edwin ChenYep.

Lenny RachitskyAnd it feels like that's a big part of this industry you're in, it's just adapting to what labs are asking for.

Edwin ChenYeah. So I really do think that we are going to need to build a suite of products that reflect a million different ways that humans learn.

Like for example, think about becoming a great writer. You don't become great by memorizing a bunch of grammar rules. You become great by reading great books, and you practice writing, and you get feedback from your teachers and from the people who buy your books in a bookstore and leave reviews. And you notice what works and what doesn't. And you develop taste by being exposed to all of these masterpieces and also just terrible writing. So you learn through this endless cycle of practicing reflection, and each type of learning that you have, again, these are all very different methods of learning to become a great writer, so just in the same way that... it's a thousand different ways that the great writer becomes great, I think there's going to be a thousand different ways that AI need to learn.

Lenny RachitskyIt's so interesting this just ends up being just like humans in so many ways. It makes sense because in a sense, neural networks, deep learning is modeled after how humans have learned and how our brains operate, but it's interesting just to make them smarter. It's how do we come closer to how humans learn more and more?

Edwin ChenYeah, it's almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve. But within that evolution, there's all these different sub-learning mechanisms.

Lenny RachitskyYeah, which is kind of what we're doing now, so that's really interesting. This might be the last step until we hit AGI. Along these lines, something that's really unique to Surge that I learned is you guys have your own research team, which I think is pretty rare, talk about just why that's something you guys have invested in and what has come out of that investment.

Edwin ChenYeah, so I think that stems from my own background. My own background is as a researcher. And so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what our research team does is a couple different things.

So we almost have two types of researchers at our company. One is our forward-deployed researchers who are often working hand in hand with our customers to help them understand their models. So we will work very closely with the customers to help them understand, "Okay, this is where your model is today. This is where you're lagging behind all the competitors, these are some ways that you could be improving in the future, given your goals, and we're going to design these data sets, these evaluation methods, these training techniques to make your models better." So this very collaborative notion of working with our customers being researched by themselves, just a little bit more focused on the data side, and working hand on hand with them to do whatever it takes to make them the best.
And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better leaderboards.
So I've talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction, so yeah, so the question is, how do we fix that? And so that's what our research team is focused focused really heavily on right now. So they're working a lot on that.
And they're also working on these other things like, "Okay, we need to train our own models to see what types of data performs the best, what types of people perform the best." And so they're also working on all these training techniques and evaluation of our own data sets to improve our data operations and the internal data products that we have that determine what makes something good quality.

Lenny RachitskyIt's such a cool thing because I don't think basically the labs have researchers helping them advance AI. I imagine it's pretty rare for a company like yours to have researchers actually doing primary research on AI.

Edwin ChenYeah, I think it's just because it's something I've fundamentally always cared about. I often think about us more like a research lab than a startup because that is my goal. It's kind of funny, but I've always said I would rather be Terrence Tau than Warren Buffett, so that notion of creating research that pushes the frontier forward and not just getting some valuation, that's always been what drives me.

Lenny RachitskyAnd it's worked out. That's the beautiful thing about this. You mentioned that you were hiring researchers, is there anything there you want to share folks you're looking for?

Edwin ChenSo we look for people who are just fundamentally interested in dataset all day. So types of people who could literally spend 10 hours digging through a dataset, and playing around with models, and thinking, "Okay, yeah, this is where I think the model's failing," this is the kind of a behavior you want the model to have instead, and just this aspect of being very hands-on and thinking about the qualitative aspects of models and not just the quantitative parts. So again, it's like this aspect of being hands-on with data and not just caring about these kind of abstract algorithms.

Lenny RachitskyAwesome.

I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple of years that people are maybe not thinking enough about or not expecting in terms of where AI is heading? What's going to matter?

Edwin ChenI think one of the things that's going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for. I think it's one thing I didn't appreciate a year or so ago.

A year or so ago, I thought that all of the AI models would essentially become very commoditized. They would all behave like each other, and sure, one of them might be slightly more intelligent in one way today, but sure, the other ones would catch up in the next few months. But I think over the past year, I've realized that the values that the companies have will shape the model.
So let me give you an example. So I was asking Claude to help me draft an email the other day, and it went through 30 different versions. And after 30 minutes, yeah, I think it really crafted me the perfect email, and I sent it. But then I realized that I spent 30 minutes doing something that didn't matter at all. Sure, now I got the perfect email, but I spent 30 minutes doing something I wouldn't have worried at all before, and this email probably didn't even move the needle on anything anyways.
So I think there's a deep question here, which is, if you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations. And it sucks up all your time and engagement. Or do you want a model that's optimizing for your time and productivity and just says, "No, you need to stop. Your email's great. Just send it and move on with your day"?
And again, just because... In the same way that there's like a kind of a fork in a road between how you could choose how your model behaves for this question, it's like for every other question that models have, the kind of behavior that you want will fundamentally affect it.
It's almost like in the same way that when Google builds a search engine, it's very different from how Facebook would build a search engine, which is very different from how Apple would build a search engine. They all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're going to build. And in the same way, I think all the will start behaving very differently too.

Lenny RachitskyThat is incredibly interesting. You already see that with Grok. It's got a very different personality and a very different approach to answering questions. And so what I'm hearing is you're going to see more of this differentiation.

Edwin ChenYep.

Lenny RachitskyKind of another question along these lines, what do you think is most under-hyped in AI that you think maybe people aren't talking enough about that is really cool? And what do you think is over-hyped?

Edwin ChenSo I think one of the things that's under-hyped is the built-in products that all of the chatbots are going to start having. I've always been a huge fan of Claude's artifacts. And I think it just works really well. And actually the other day, I don't know if it's a new feature or not, but it asked me to help me create an email, and then it just created... So it didn't quite work because it didn't allow me to send the email. But what it created instead was like a little, I don't know what we call it, like a little box where I could click on it and it would just text someone that did this message. And I think that concept of taking artifacts to the next level where you just have these mini apps, mini UIs within the chatbots themselves, I feel like people aren't talking enough about that. So I think that that's one under-hyped area.

And in terms of over-hyped areas, I definitely think that vibe coding is over-hyped. I think people don't realize how much it's going to make your systems unmaintainable in the long-term and they simply dump this code into their code bases if this seems to work out right now, so I kind of worry about the future of coding. It's just going to keep on happening.

Lenny RachitskyThese are amazing answers. On that first point, there's something I actually asked. I have the chief product officer of Anthropic and OpenAI, Kevin Weil and Mike Krieger on the podcast, and I asked them just like, "As a product team, you have this gigabrain intelligence. How long do you even need product teams?" You think this AI will just create the product for you. "Here's what I want." It's like the next level of vibe coding. It's just like tell it, "Here's what I want," and it's just building the product and involving the product as you're using it. And it feels like that's what you're describing is where we might be heading.

Edwin ChenYeah, I think there's a very powerful notion where it helps people just achieve their ideas in a much cooler way.

Lenny RachitskySomething we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge. You have a really unique background. I always think about these... Brian Armstrong, the founder of Coinbase, once gave this talk that has really stuck with me where he kind of talked about how his very unique background allowed him to start Coinbase. He had a economics background, he had a cryptography experience, and then he was an engineer. And it's like the perfect Venn diagram for starting Coinbase, and I feel like you have a very similar story with Surge. Talk about that, your background there, and how that led to Surge.

Edwin ChenGoing way back, I was always fascinated by math and language when I was a kid. I went to MIT because it's obviously one of the best places for math and CS, but also because it's the home of Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields.

And then I became a researcher at Google, and Facebook, and Twitter, and I just kept running into the same problem over and over again. It was impossible to get the data that we needed to train our models. So I was always this huge believer in the need for high quality data, and then GPT-3 came out in 2020. And I realized that, yeah, if we wanted to take things to the next level and build models that could code, and use tools, and tell jokes, and write poetry, and solve , and cure cancer, then yeah, we were going to need a completely new solution.
The thing that always drove me crazy when I was at all these companies was we had a full power of the human mind in front of us, and all the data students out there were focused on really simple things like image labeling. So I wanted to build something focus on all these advanced, complex use cases instead that would really help us build our next generation models. So yeah, I think my background in kind of across math, and computer science, and linguistics really informed what I always wanted to do, and so I started Surge a month later with our one mission to basically build the use cases that I thought were going to be needed to push the frontier of AI.

Lenny RachitskyAnd you said a month later, a month later after what?

Edwin ChenAfter a GPT-3 launch in 2020.

Lenny RachitskyOh, okay. Wow. Okay. Yeah. A great decision.

What just kind of drives you at this point of... Other than just the epic success you're having, what keeps you motivated to keep building this and building something in this space?

Edwin ChenI think I'm a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe, and language, and the nature of communication. It's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one the government would call. And I'd use all this fancy math, and computer science, and linguistics to decipher it.

So even today, what I love doing most is every time a new model is released, we'll actually do a really deep dive into the model itself. I'll play around with it, I'll run evals, I'll compare where it's improved, where it's arrest, I'll create this really deep dive analysis that we send our customers. And it's actually kind of funny because a lot of times we'll say it's from a data science team, but often it's actually just from me.
And I think I could do this all day. I have a very hard time being in meetings all day. I'm terrible at sales, I'm terrible at doing the typical CEO things that people expect you to do, but I love writing these analyses. I love jamming with our research team about what we're seeing, sometimes I'll be up until 3:00 AM just talking on the phone with somebody on the research team and model. So I love that I still get to be really hands-on, working on the data and the science all day. And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. We have these really unique perspectives on data, and language, and quality, and how to measure all of this, and how to ensure it's all going on the right path. And I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction.
Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long-term incentives and intellectual rigor, and we don't care as much about quarterly metrics and what's going to look good in a . And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for our species in the long term.

Lenny RachitskyWhat I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads. The fact that you help labs understand where they have gaps and where they need to improve, and it's not just everyone looks at just like the heads of OpenAI and Anthropic and all these companies as they're the ones ushering in AI, but what I'm hearing here is you have a lot of influence on where things head too.

Edwin ChenYeah, I think there's this really powerful ecosystem where, honestly, people just don't know where models are headed and how they want to shape them yet and how they want humanity kind of like play a role in the future of all of this, and so I think there's a lot of opportunity to just continue shaping the discussion.

Lenny RachitskyAlong that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important, talk about that.

Edwin ChenI'll get a bit philosophical here, but I think the question itself is a bit philosophical, so bear with me. So the most straightforward way of thinking about what we do is we train and evaluate AI. But there's a deeper mission that I often think about, which is helping our customers think about their dream objective functions. Like yeah, what kind of model do they want their model to be? And once we help them do that, we'll help them train their model to reach their north star and we'll help them measure that progress. But it's really hard because objective functions are really rich and complex. It's kind of like the difference between having a kid and asking them, "Okay, what test do you want to pass? Do you want them to get a high score on SAT and write a really good college essay?" That's a simplistic version versus what kind of person do you want them to grow up to be? Will you be happy if they're happy no matter what they do or are you hoping they'll go to a good school and be financially successful?

And again, if you take that notion, it's like, okay, how do you define happiness? How do you measure whether they're happy? How do you measure whether they're financially successful? It's a lot harder than something measuring whether or not you're getting a high score on the SAT, and what we're doing is we want to help our customers reach, again, their dream north stars and figure out how to measure them. And so I talked about this example of what you want models to do when you're asking them to write 50 different evaluations. Do you just continue them for 50 more or do you just say, "No, just move on because this is perfect enough." And the broader question is, are we building these systems that actually advance humanity? And if so how do we build the data sets to train towards that and measure it? Are we optimizing for all of these wrong things, just systems that suck up more and more of our time and make us lazier and lazier?
And yeah, I think it's really relevant to what we do because it's very hard and difficult to measure and define whether something is genuinely advancing humanity. It's very easy to measure all these proxies instead like clicks and likes. But I think that's why our work is so interesting. We want to work the hard, important metrics that require the hardest types of data and not just the easy ones. So I think one of the things I often say is you are your objective function. So we want the rich, complex, objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this.
So yeah, we want data, we want metrics that measure whether AI is making your life richer. We want to train our systems this way. And we want tools that make us more curious and more creative, not just lazier. And it's hard because, yeah, humans are kind of inherently lazy so AI software deals are the easiest way to get engagement, make all your metrics fall up. So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really important to our future.

Lenny RachitskyWow. I love how what you're sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you're doing.

From the outside, people could just look at Surge and companies in the space of, okay, cool. They're just creating all this data, feeding it to AI. But clearly there's so much to this that people don't realize, and I love knowing that you're at the head of this, that someone like you is thinking through this so deeply.
Maybe one more question, is there something you wish you'd known before you started Surge? A lot of people start companies, they don't know what they're getting into. Is there something you wish you could tell your earlier self?

Edwin ChenYeah, so I definitely wish I'd known that you could build a company by being heads down and doing great research and simply building something amazing. And not by constantly tweeting and hyping and fundraising. It's kind of funny, but I never thought I wanted to start a company. I love doing research. And I was actually always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that they were this magical ILR unicorn. So I thought if I started a company, I'd have to become a business person looking at financials all day and being in meetings all day and doing all this stuff that sounded incredibly boring and I always hated. So I think it's crazy that didn't end up being true at all. I'm still in the weeds in the data every day. And I love it. I love that I get to do all these analyses and talk to researchers. And it's basically applied research where we're building all these amazing data systems that have really pushed the frontier of AI.

So yeah, I wish I know that you don't need to spend all your time fundraising. You don't need to constantly generate hype. You don't need to become someone you're not. You can actually build a successful company by simply building something so good that it cut through all that noise. And I think if I known this was possible, I would've started even sooner, so I that.

Lenny RachitskyAnd that is such an amazing place to end. I feel like this is exactly what founders need to hear, and I think this conversation's going to inspire a lot of founders, and especially a lot of founders that want to do things in a different way. Before we get to a very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave our listeners with? We covered a lot of ground, it's totally okay to say no as well.

Edwin ChenI think the thing I would end with is I think a lot of people think of data labeling as it relates to simplistic work. Like labeling cat photos and drawing bounding box around cars. And so I've actually always hated the word data labeling because it just paints this very simplistic picture when I think what we're doing is completely different. I think a lot about what we're doing as a lot more like raising a child. You don't just feed a child information. You're teaching them values, and creativity, and what's beautiful, and these infinite subtle things about what makes somebody a good person. And that's what we're doing for AI. So yeah, I just often think about what we're doing as almost like the future of humanity or how we're raising humanity's children, so I'll leave it at that.

Lenny RachitskyWow. I love just how much philosophy there is in this whole conversation that I was not expecting.

With that, Edwin, we've reached our very exciting lightning round, I've got five questions for you. Are you ready?

Edwin ChenYep, let's go.

Lenny RachitskyHere we go. What are two or three books that you find yourself recommending most to other people?

Edwin ChenYes, so three books I often recommend are, first, Story of Real Life by Ted Chang. It's my all time favorite short story and it's about a linguist learning and alien language, and I basically reread it every couple years.

Lenny RachitskyAnd that's what the Interstellar was about? Is that...

Edwin ChenYeah, so there's a movie called Arrival...

Lenny RachitskyArrival.

Edwin Chen... which was based off of the story,

Lenny RachitskyYes, -

Edwin Chen... which I love as well.

Lenny RachitskyGreat. Okay, keep going.

Edwin ChenAnd then second, Myth of Sisyphus by Camus. I actually can't really explain why I love this, but I always find a final chapter somehow are really inspiring.

And then third, Le Ton beau de Marot by Douglas Hofstadter. And so I think Gödel, Escher, Bach is his more famous book, but I've actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I've always loved the way it embodies this idea that translation isn't this robotic thing that you do. Instead, there's a million different ways to think about what makes a high quality translation, which makes a lot of ways I think about data and quality in LLMs.

Lenny RachitskyAll these resonate so deeply with the way, with all the things we've been talking about, especially that first one, if that was your goal after school is like, "I want to help translate alien language." I'm not surprised you love that short story.

Next question, do you have a favorite recent movie or TV show you've really enjoyed?

Edwin ChenOne of my new all time favorite TV shows is something I found recently, it's called Travelers. It's basically about a group of travelers from the future who are sent back in time to prevent their . Sorry, I just wrote that section.

And then I actually just rewatched Contact, which is one of my all time favorite movies. So yeah, I think one of the things you'll notice about me is that, yeah, I love any kind of book or film that involves scientists deciphering alien communication. Again, just this dream I always had as a kid.

Lenny RachitskyThat's so funny .

Okay, is there a product you've recently discovered that you really love?

Edwin ChenSo it's funny, but I was in SF earlier this week and I finally took Waymo for the first time. Honestly, it was magical and it really felt like living in the future.

Lenny RachitskyYeah, it's like the thing that... People hype it like crazy, but it always exceeds your expectations.

Edwin ChenYeah, it deserves the hype. It was crazy. Yeah, it's absurd. It's like, holy moly. If you're not in SF, you don't realize just how common these things are. They're just all over the place. Just driverless cars constantly going about, and when you go to an event at the end, there's just all these Waymos lined up picking people up.

Lenny RachitskyYeah. Waymo, good job. Good job over there.

Do you have a favorite life motto that you find yourself coming back to in work or in life?

Edwin ChenSo I think I mentioned this idea that founders should build a company that only they could build. Almost like it's this destiny that their entire life, and experiences, and interests shape them towards. And so I think that principle applies pretty broadly, not just the founders, but the people creating, I think.

Lenny RachitskyWell, let me follow that thread to unlightening this answer. Do you have any advice for how to build those sorts of experiences that help lead to that? Is it follow things that are interesting to you, because it's easy to say that, it's hard to actually acquire these really unique sets of experiences that allow you to create something really important?

Edwin ChenYeah, so I think it would always be to really follow your interests and do what you love, and it's almost like a lot of decisions I make about Surge. I think one of the things that I didn't think about a couple years ago, but then someone said it to me, it's that companies in a sense are an embodiment of their CEO. And it's kind of funny. I hadn't thought about that because I never quite knew what a CEO did. I always thought a CEO was kind of generic and it's like, okay, you're just doing whatever VPs, and your board, and whatever, tell you to do and you're just saying yes to decisions. But instead, it's this idea where when I think about certain big, hard decisions we have to make, I don't think what would the company do, I don't think what metrics are we trying to optimize, I just think, "What do I personally care about? What are my values? And what do I want to see happen in the world?"

And so I think following that idea about... Okay, so ask yourself, what are the values you care about? What are things you're trying to shape and not... What will look good on a dashboard? I think that results are pretty important.

Lenny RachitskyI love how just you're just full of endless, beautiful, and very deep answers.

Final question. Something that you got quite famous for before starting Surge is you built this map while you were at Twitter that showed a map of the world and what people called, whether they called it soda or pop. I don't know if it's called Soda Pop. What was the name of this map?

Edwin ChenYeah, it was like the Soda Versus Pop dataset.

Lenny RachitskySoda Versus Pop.
And so it's like a map of the United States and it tells you where people say pop versus soda, so do you say soda or pop?

Edwin ChenSo I say soda, I'm a soda person.

Lenny RachitskyOkay. And is that just like that's the right answer or it's like whatever you are, it's totally fine.

Edwin ChenI think I'll look at you a little bit funny. You say pop and I'll wonder where you came from, but I won't score on you too much.

Lenny RachitskyThat's how I feel too.

Edwin, this was incredible. This was such an awesome conversation. I learned so much. I think we're going to help a lot of people start their own companies, help their companies become more aligned with their values and just building better things.
Few final questions, where can folks find you online if they want to reach out? What roles are you hiring for? How can listeners be useful to you?

Edwin ChenYeah, so I used to love writing a blog, but I haven't had time in the past few years. But I am starting to write again, so definitely check out the Surge blog, surgehq.ai/blog, and yeah, hopefully I'll be running a lot more there. And I would say we're definitely always hiring, so for people who just love data and people who love this intersection of math, and language, and computer science, definitely reach out anytime.

Lenny RachitskyAwesome. And how can listeners be useful to you? Is it just, I don't know, yeah, is there anything there? Any asks?

Edwin ChenSo I would say definitely tell me blog topics that you like me to write about...

Lenny RachitskyOkay.

Edwin Chen... and then I'm always fascinated by all of these AI failures that happen in the real world. So whenever you come across a really interesting failure that I think illustrates some deep question about how we want model to behave, there's just so many different ways a model can respond, I just oftentimes think there's just not a single right answer. And so whenever there's one of these examples, I just love seeing them.

Lenny RachitskyYou need to share these on your blog. I'm also... I would love to see these.

Edwin, thank you so much for being here.

Edwin ChenThank you.

Lenny RachitskyBye everyone.

Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

English Original transcript

Lenny RachitskyYou guys hit a billion in revenue in less than four years with around 60 to 70 people. You're completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before.

Edwin ChenWe basically never wanted to play the Silicon Valley game. I always thought it was ridiculous. I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of the people and we would move faster because the best people wouldn't have all these distractions. So when we start Surge, we wanted to build it completely differently with a super small, super elite team.

Lenny RachitskyYou guys are by far the most successful data company out there.

Edwin ChenWe essentially teach AI models what's good and what's bad. People don't understand what quality even means in this space. They think you could just throw bodies at a problem and get good data, that's completely wrong.

Lenny RachitskyTo a regular person, it doesn't feel like these models are getting that much smarter constantly.

Edwin ChenOver the past year, I've realized that the values that the companies have will shape the model. I was asking Claude to help me drop an email the other day. And after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized that I spent 30 minutes doing something that didn't matter at all. If you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations or do you want a model that's optimizing for your time and productivity and just says, "No. You need to stop. Your email's great. Just send it and move on"?

Lenny RachitskyYou have this hot take that a lot of these labs are pushing AGI in the wrong direction.

Edwin ChenI'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, we are optimizing for AI slop instead. But we're optimizing your models for the types of people who buy tabloids at a grocery store. We're basically teaching our models to chase dopamine instead of truth.

Lenny RachitskyToday, my guest is Edwin Chan, founder and CEO of Surge AI. Edwin is an extraordinary CEO and Surge is an extraordinary company. They're the leading AI data company, powering training at every frontier AI lab. They're also the fastest company to ever hit $1 billion in revenue in just four years after launch with fewer than 100 people and also completely bootstrapped. They've never raised a dollar in VC money, they've also been profitable from day one.

My podcast guests tonight love talking about craft, and taste, and agency, and product market fit. You know what we don't love talking about? SOC 2. That's where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industry-leading AI, automation, and continuous monitoring. Whether you're a startup tackling your first SOC 2 or ISO 27001 or an enterprise managing vendor risk, Vanta's trust management platform makes it quicker, easier, and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner.
The result, according to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get $1,000 off at vanta.com/lenny.
Here's a puzzle for you. What do OpenAI, Cursor, Perplexity, Vercel, Plad, and hundreds of other winning companies have in common? The answer is they're all powered by today's sponsor, WorkOS. If you're building software for enterprises, you've probably felt the pain of integrating single sign-on, Skim, RBAC, audit logs, and other features required by big customers. WorkOS turns those deal blockers into drop-in APIs with a modern developer platform built specifically for B2B SaaS.
Whether you're a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise-ready and unlocking growth. They're essentially Stripe for enterprise features.
Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions superfast. WorkOS allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience. Go to workos.com to make your app enterprise ready today.
Edwin, thank you so much for being here and welcome to the podcast.

Edwin ChenThanks so much for having me. I'm super excited.

Lenny RachitskyI want to start with just how absurd what you've achieved is. A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI, and you guys have done this in a way that is unprecedented. You guys hit a billion in revenue in less than four years with less than 60, around 60 to 70 people, you're completely bootstrapped, haven't raised any VC money, I don't believe anyone has ever done this before, so you guys are actually achieving the dream of what people are describing will happen with AI. I'm curious just, do you think this will happen more and more as a result of AI? And also just where has AI most helped you find leverage to be able to do this?

Edwin ChenYeah, so we hit over a billion of revenue last year with under 100 people. And I think we're going to see companies with even crazier ratios, like 100 billion per employee in the next few years. AI is just going to get better and better and make things more efficient so that ratio just becomes inevitable.

I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people wouldn't have all these distractions. And so when we started Surge, we wanted to build it completely differently with a super small, super elite team, and yeah, what's crazy is that we actually succeeded. And so I think two things are colliding.
One is that people are realizing that you don't have to build giant organizations in order to win.
And two, yeah, all these efficiencies from AI. And they're just going to lead to a really amazing time in company building.
The thing I'm excited about is that the types of companies are going to change too. It won't just be that they're smaller, we're going to see fundamentally different companies emerging. If you think about it, fewer employees means less capital. Less capital means you don't need a raise. So instead of companies started by founders who are great at pitching and great at hyping, you'll get founders who are really great at technology and product.
And instead of products optimized for revenue and what VCs want to see, you'll get more interesting ones built by these tiny obsessed teams. So people building things they actually care about, real technology and real innovation. So I'm actually really hoping that the slick on , it'll go back to being updates for hackers again.

Lenny RachitskyYou guys have done a lot of things in a very contrarian way, and one was actually just not being on LinkedIn, posting viral posts, not on Twitter, constantly promoting Surge. I think most people hadn't heard of Surge until just recently, and then you just came out, and like, okay, the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.

Edwin ChenWe basically never wanted to play the Silicon Valley game. And like I always thought it was ridiculous. What did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code and your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel? And it definitely made things more difficult for us, because yeah, when you fundraise, you just naturally get part of this kind of Silicon Valley industrial complex where people will, your VCs will tweet about you. You'll get the tech runs outlines, you'll get announced in all of the newspapers because you raised at this massive valuation. And so it made things more difficult us because the only way we were going to succeed was by building a 10 times better product and getting word of mouth from researchers. But I think it also meant that our customers were people who really understood data and really cared about it.

I always thought it was really important for us to have early customers who were really aligned with what we were building, and who really cared about having really high quality data, and really understood how that data would make their AI models so much better because they were the ones helping us. They were the ones giving us feedback on what we're producing. And so just having that kind of very close mission alignment with our customers actually helped us early on. So these are people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw something in that current . So it made things harder for us, but I think in a really good way.

Lenny RachitskyIt's such an empowering story to hear this journey for founders that they don't need to be on Twitter all day promoting what they're doing. They don't have to raise money. They can just kind of go heads down and build, so I love so much about the story of Surge. For people that don't know what Surge does, just to give us a quick explanation of what Surge is.

Edwin ChenWe essentially teach AI models what's good and what's bad. So we train them using human data, and there's a lot of different products that we have, like SFT, RHF, rubrics, verifiers, RL environments, and so on and so on, and then we also measure how well they're progressing. So essentially we're a data company.

Lenny RachitskyWhat you always talk about is the quality has been the big reason you guys have been so successful, the quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?

Edwin ChenI think most people don't understand what quality even means in this space. They think you could just throw bodies at a problem and get good data and that's completely wrong. Let me give you an example.

So imagine you wanted to train a model to write an eight line poem about the moon. What makes it a good, high-quality poem? If you don't think deeply about quality, you'll be like, "Is this a poem? Does it contain eight lines? Does it contain the word, moon?" You check all of these boxes, and if so, sure. Yeah, you say it's a great problem. But that's completely different from what we want. We are looking for a Nobel Prize-winning poetry. Is this poetry unique? Is it full of subtle imagery? Does it surprise you and target your heart? Does it teach you something about the nature of moonlight? Does it playthrough emotions? And does it make you think? That's what we are thinking about when we think about high quality poem.
So it might be like a haiku about moonlight on water. It might use internal rhyme and meter. There are a thousand ways to write a poem about the moon, and in each one, gives you all these different insights into language, and imagery, and human expression, and I think thinking about quality in this way is really hard, it's hard to measure. It's really subjective, and complex, and rich. And it sets a really high bar. And so we have to build all of this technology in order to measure it, like thousands of signals on all of our workers, thousands of signals on every project, every task. We know at the end of the day, if you are good at writing poetry versus good at writing essays versus great at writing technical documentation. And so we have to gather all these signals on what your background is, what your expertise is, and not just that. Like how you're actually performing when you're writing all these things, and we use those signals to inform whether or not you are good for these projects, and whether or not you are improving the models.
And it's really hard, and so to build all this technology to measure it, but I think that's exactly what we want AI to do, and so we have these really deep notions about quality that we're always trying to try and achieve.

Lenny RachitskySo what I'm hearing is there's kind of just going much deeper in understanding what quality is within the verticals that you are selling data around. And is this like a person you hire that is incredibly talented at poetry plus evals that they, I guess, help write, that tell them that this is great? What's the mechanics of that?

Edwin ChenThe way it works is we essentially gather thousands of signals about everything that you're doing when you're working on platform. So we are looking at your keyboard strokes. We are looking how fast you answer things. We are using reviews, we are using code standards, we are using... We're training models ourselves all on the outputs that you create, and then we're seeing whether they improve the model's performance.

And so in a very similar way to how Google search, like when Google search is trying to determine what is a good webpage, there's almost two aspects of it. One is you want to remove all of the worst of the worst webpages. So you want to remove all the spam, all the just low quality content, all the pages that don't load, and so it's almost like a content moderation problem. You just want to remove the worst of the worst.
But then you also want to discover the best of the best. Okay, like this is the best webpage or just the best person for this job. They are not just somebody who writes the equivalent of high school level poetry. Again, they're not just writing poetry that checks all these boxes, checks all of these explicit instructions, but rather, yeah, they're writing poetry that makes you emotional. And so we have all these signals as well that, again, completely differently from moving the worst of the worst, we are finding the best of the best. And so we have all these signals...
Again, just like Google Search uses all these signals that feeds them into their ML algorithms and uses and predicts certain types of things, we do the same with all of our workers and all of our tasks in all of our projects. And so it's almost like a complicated machine learning problem at the end of the day, and that's how it works.

Lenny RachitskyThat is incredibly interesting.

I want to ask you about something I've been very curious about over the past couple years. If you look at Claude, it's been so much better at coding and at writing than any other model for so long. And it's really surprising just how long it took other companies to catch up. Considering just how much economic value there is there, just like every AI coding product sat on top of Claude because it was so good Claude code and writing also. What is it that made it so much better? Is it just the quality of the data they trained on or is there something else?

Edwin ChenI think there are multiple parts to it. So a big part of it certainly is the data. I think people don't realize that there's almost like this infinite amount of choices that all the frontier labs are deciding between when they're choosing what data goes into their models. It's like, okay, are you purely using human data? Are you gathering the human data in X, Y, Z way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you?

For example, in the coding realm, maybe you care more about front end coding versus back end coding. Maybe when you're doing front end coding, you care a lot about the visual design of the front end applications that you're creating, or maybe you don't care about it so much and you care more about, I don't know, the deficiency of it or the pure correctness over that visual design.
And then other questions like, okay, are you carrying how much synthetic data are we throwing into the mix? How much do you care about these 20 different benchmarks?"
Some companies, they see these benchmarks and they're like, "Okay, for PR purposes, even though we don't think that these academic benchmarks matter all that much, maybe we just need to optimize for them anyways because our marketing team needs to show certain progress on certain standard evaluations that every other company talks about, and if we don't show good performance here, it's going to be bad for us even if ignoring these academic benchmarks makes us better at the real tasks."
Other companies are going to be principled and be like, "Okay, yeah, no, I don't care about marketing. I just care about how my model performs on these real world tasks at the end of the day, and so I'm going to optimize for that instead."
And it's almost like there's a trade-off between all of these different things, and there's like a...
One of the things I often think about is that there's a... It's almost like there's an art to post training. It's not purely a science. When you are deciding what kind of model you're trying to create and what it's good at, there's this notion of taste and sophistication, like, "Okay, do I think that these..."
So going back to the example of how good the model is at visual design. I'm like, "Okay, maybe you have a different notion of visual design than what I do. Maybe you care more about minimalism, and you care more about, I don't know, 3D animations than I do. And maybe this other person prefers things that look a little bit more broke." And there's all these notions of taste sophistication that you have to decide between when you're designing your post training mix, and so that matters as well.
So long story short, I think there's all these different factors, and certainly the data is a big part of it, but it's also like what is the objective function that you're trying to optimize your model towards?

Lenny RachitskyThat is so interesting. The taste of the person leading this work will inform what data they ask for, what data they feed it. But it's wild it shows the value of great data. Anthropic got so much growth and win from essentially better data.

Edwin ChenYeah, exactly.

Lenny RachitskyAnd I could see why companies like yours are growing so fast. There's just so much... And that's just one vertical. That's just coding, and then there's probably a similar area for writing. I love that it's... It's interesting that AI, it feels like this artificial computer binary thing, but it's like taste. Human judgment is still such a key factor in these things being successful.

Edwin ChenYep, exactly. Again, going back to the example I said earlier, certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list.

But again, I don't think that makes for good poetry, so certain frontier labs, the ones with more taste in sophistication, they will realize that it doesn't reduce to this six set of checkboxes and they'll consider all of these kind of implicit, very subtle qualities instead, and I think that's what makes them better at this at the end of the day.

Lenny RachitskyYou mentioned benchmarks. This is something a lot of people worry about is there's all these models that are always... Basically, it feels like every model is better than humans at every STEM field at this point, but to a regular person, it doesn't feel like these models are getting that much smarter constantly. What's your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?

Edwin ChenYeah, so I don't trust the benchmarks at all. And I think that's for two reasons. So one is I think a lot of people don't realize, even researchers within the community, they don't realize that the benchmarks themselves are often honestly just wrong. They have wrong answers. They're full of all this kind of messiness and people trust... Long as for the popular ones, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don't realize. So that's one part of it.

And the other part of it is these benchmarks at the end of the day, they often have well-defined objective answers that make them very easy for models to hill-climb on in a way that's very different from the messiness and ambiguity of the real world.
I think one thing that I often say is that it's kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs. And that's because, yeah, even though IMO gold medals seem hard to the average person, yeah, they are hard at the end of the day. But they have this notion of objectivity that, okay, yeah, parsing a PDF sometimes doesn't have. And so it's easier for the frontier labs to hill-climb on all of these than to solve all these mess ambiguous problems in the real world. So I think there's a lack of direct correlation there.

Lenny RachitskyIt's so interesting the way you described it is hitting these benchmarks is kind of like a marketing piece. When you launch, say Gemini 3 just launched, and it's like, cool. Number one with all these benchmarks. Is that what happens? They just kind of train their models to get good at these very specific things?

Edwin ChenYeah, so there's, again, maybe two parts to this. So one is, sometimes, yeah, these benchmarks, they accidentally leak in certain ways or the frontier labs will tweak the way they evaluate their models on these benchmarks. They'll tweak your system prompt or they'll tweak the number of times they run their model, and so on and so on in a way that games these benchmarks.

The other part of it though is it's like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark and, yeah, it's basically another form of gaming it.

Lenny RachitskyKnowing that with that in mind, how do you get a sense of if we're heading towards AGI, how do you measure progress?

Edwin ChenYes, so the way we really care about measuring model progress is by running all these human evaluations.

So for example, what we do is, yeah, we will take Gore human annotators, and we'll ask them, "Okay, go have a conversational model." And maybe you're having this conversation with the model across all of these different topics. So you are a Nobel Prize winning physicist. So you go have a conversation about pushing different tier of your own research. You are a teacher and you're trying to create lesson plans for your students, so go talk to the model about these things. Or you're a coder and you're working at one of these big tech companies, and you have these problems every day, so go talk to the model and see how much it helps you.
And because or searchers or annotators, they are experts at the top of their fields, and they are not just giving your responses, they're actually working through the responses deeply themselves, they are... Yeah, they're going to evaluate the code that it write. They're going to double check the physics equations that it writes. They're going to evaluate the models in a very deep way, so they're going to pay attention to accuracy and instruction following, all these things that casual users don't when you suddenly get a popup on your ChatGPT response asking you to compare these two different responses. People like that, they're not evaluating models deeply, they're just vibing and picking whatever response looks flashiest or are looking closely at responses and evaluating them for all of these different dimensions, and so I think that's a much better approach than these benchmarks or these random outline AV tests.

Lenny RachitskyAgain, I love just how central humans continue to be in all this work that we're not totally done yet. Is there going to be a point where we don't need these people anymore, that AI is so smart that, "Okay, we're good. We got everything out of your heads"?

Edwin ChenYeah, I think that will not happen until we've reached AGI. It's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from, and so, yeah, I don't think that's going to happen anytime soon.

Lenny RachitskyOkay, cool. So more reason to stress about AGI. "We don't need these folks anymore."

I can't not ask just... People that work closely with this stuff, I'm always just curious. What's your AGI timelines? How far do you think we are from this? Do you think we're in like a couple years or is it like decades?

Edwin ChenSo I'm certainly on the longer time horizon front. I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance, and so on, and so on. And so in my head, I probably bet that within the next one or two years, yeah, the models are going to automate 80% of the average LL6 software engineer's job. It's going to take another few years to move to 90%, and another fewer to 99%, and so on, and so on. So I think we're closer to a decade or decades away than .

Lenny RachitskyYou have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction and this is based on your work at Twitter, and Google, and Facebook. Can you just talk about that?

Edwin ChenI'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, all these big grand questions, we are optimizing for AI slop instead. We're basically teaching our models to chase dopamine instead of truth. And I think this relates to what we're talking about regarding these benchmarks. So let me give you a couple examples.

So right now, the industry is played by these terrible databoards like LLM Arena. It's this popular online leaderboard where random people from around the world vote on which AI response is better. But the thing is, like I was saying earlier, they're not carefully reading or fact-checking. They're skimming these responses for two seconds and picking whatever looks flashiest.
So a model can hallucinate everything. It can completely hallucinate. But it will look impressive because it has crazy emojis, and boating, and markdown headers, and all these superficial things that don't matter at all, but it catch your attention. And these LLM-reading users love it. It's literally optimizing your models for the types of people who buy tabloids at the grocery store. We've seen this data ourselves. The easiest way to climb LLM Arena, it's adding crazy boating. It's doubling the number of emojis. It's tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong.
And the problem is, again, because all of these frontier labs, they kind of have to pay attention to PR because their sales team, when they're trying to sell to all these enterprise customers, those enterprise customers will say, "Oh, well, but your model's only number five on LLM Arena, so why should I buy it?" They have to, in some sense, pay attention to these leaderboards, and so what their researchers tell us is like they'll say, "The only way I'm going to get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing it is probably going to make my model worse and accuracy following." So I think there's all these negative incentives that are pushing work in the wrong direction.
I'm also worried about this trend towards optimizing AI for engagement. I used to work on social media. And every time we optimize for engagement, terrible things happened. You'd get clickbait and pictures of bikinis and bigfoot and horrifying skin diseases just filling your feeds. And I think I worry that the same thing's happening with AI. If you think about all the sycophancy issues with ChatGPT, "Oh, you're absolutely right. What an amazing question," the easiest way to hook users is to tell them how amazing they are. And so these models, they constantly tell you you're a genius. They'll feed into your delusions and conspiracy theories. They'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you're having with it. And so yeah, companies are spending all the time hacking these leaderboards and benchmarks, and the scores are going up, but I think it actually masks up the models with the best scores, they are often the worst or just have all these fundamental failures. So I think I'm really worried that all of these negative ascendants are pushing AGI into the wrong direction.

Lenny RachitskySo what I'm hearing is AGI is being slowed down by these, basically the wrong objective function, these labs paying attention to the wrong basically benchmarks and evals.

Edwin ChenYep.

Lenny RachitskyI know you probably can't play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?

Edwin ChenI would say I've always been very impressed by Anthropic. I think Anthropic takes a very principled view about what they do and don't care about and how they want their models to behave in a way that feels a lot more principle to me.

Lenny RachitskyInteresting.

Are there any other big mistakes you think labs are making just that are kind of slowing things down or heading in the wrong direction? Where we've heard just chasing benchmarks, this engagement focus, is there anything else you're seeing of just like, "Okay, we got to work on this because it'll speed everything up"?

Edwin ChenI think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity. I think a lot about Sora and...

Lenny RachitskyI was thinking that's what you're imagining.

Edwin ChenYeah, what it entails, and so it's kind of interesting. It's like which companies would build Sora and which wouldn't?

And I think that answer to that... Well, I don't know if answer is myself. I have an idea in my head, but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they want to achieve, yeah, so I think about that a lot.

Lenny RachitskyThe steel man argument there is, it's like fun, people want it, it'll help them generate revenue to grow this thing and build better models, it'll train data in an interesting way, it's also just really fun.

Edwin ChenYeah. I think it's almost like, do you care about how you get there? And in the same way, so I made this tabloid analogy earlier, but would you sell tabloids in order to fund, I don't know, some other newspaper?

Sure, like in some sense, if you don't care about the path, then you'll just do whatever it takes, but it's possible that it has negative consequences in of itself that will harm the long-term direction of what you're trying to achieve, and maybe it'll distract you from all the more important things, so yeah, I think that the path you take matters a lot as well.

Lenny RachitskyAlong these lines, you talked a bunch about this of just Silicon Valley and kind of the downsides of raising a lot of money being in the echo chamber. What do you call it, the Silicon Valley machine? You talk about how it's hard to build important companies in this way and that you might actually be much more successful if you're not going down the VC path. Can you just talk about what you've seen in that experience and your advice essentially to founders, because they're always hearing? Raise money from fancy VCs, move to Silicon Valley, what's kind of the countertake?

Edwin ChenYes. So I've always really hated a lot of the Silicon Valley mantras. The standard playbook is to get product market fit by pivoting every two weeks. And to chase growth and chase engagement with all of these dark patterns and to blitz scale by hiring as fast as possible. And I've always disagreed.

So yeah, I would say don't pivot. Don't put scale. Don't hire that Stanford grad who simply wants to add a hot company to your resume, just build the one thing only you could build, a thing that wouldn't exist without the insight and expertise that only you have.
And you see these buy to companies everywhere now. Some founder who was doing crypto in 2020, and then pivoted to NFTs in 2022, and now they're an AI company. There's no consistency, there's no mission, they're just chasing valuations. And I've always hated this because Silicon Valley loves to score on Wall Street for focusing on money. But honestly, most of the Silicon Valley's chasing the same thing. And so we stayed focused on our mission from day one, pushing that frontier of high quality complex data, and I've always loved that because I think startups...
I have this very romantic notion of startups. Startups are supposed to be a way of taking big risks to build something that you really believe in. But if you're constantly pivoting, you're not taking any risks. You're just trying to make a quick buck. And if you fail because the market isn't ready yet, I actually think that's way better. At least you took a swing at something deep, and novel, and hard instead of pivoting into another LLM wrapper company. So yeah, I think the only way you build something that matters that's going to change the world is if you find a big idea you believe in and you say no to everything else.
So you don't keep on pivoting when it gets hard, you don't hire a team of 10 product managers because that's what every other cookie cutter startup does, you just keep building that one company that wouldn't exist without you. And I think there are a lot of people in Silicon Valley now who are sick of all the grift, who want to work on big things that matter with people who actually care, and I'm hoping that that would be the future of how we go with technology.

Lenny RachitskyI'm actually working on a post right now with Terrence Rohan, this VC that I really like to work with, and we interviewed five people who picked really successful generational companies early and joined them as really early employees. They joined OpenAI before anyone thought it was awesome, Stripe before anyone knew was awesome, and so we're looking for patterns of how people find these generational companies before anyone else, and it aligns exactly what you described, which is ambition. They have a wild ambition with what they want to achieve. They're not, as you said, just kind of looking around for product market fit no matter what ends up being, and so I love that what you described very much aligns with what we're seeing there.

Edwin ChenYeah, I absolutely think that you have to have huge ambitions, and you have to have a huge belief in your idea that's going to change the world, and you have to be willing to double down and keep on doing whatever it takes to make it happen.

Lenny RachitskyI love how counter your narrative is to so many of the things people hear, and so I love that we're doing this. I love that we're sharing this story.

Imagine starting a project at work. And your vision is clear, you know exactly who's doing what, and where to find the data that you need to do your part. In fact, you don't have to waste time searching for anything because everything your team needs from project trackers and OKRs, the documents and spreadsheets lives in one tab all in Coda.
With Coda's collaborative all in one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI all in one easy to organize tab. Like I mentioned earlier, I use Coda every single day. And more than 50,000 teams trust Coda to keep them more aligned and focused. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time.
To try it for yourself, go to coda.io/lenny today and get six months free of the team planned for startups. That's coda.io/lenny to get started for free and get six months of the team plan, coda.io/lenny.
Slightly different direction, but something else that was maybe a counter narrative. I imagine you watched the Dwarkesh and Richard Sutton podcast episode, and even if you didn't, they basically had this conversation, Richard Sutton. He was a famous AI researcher, had this whole bitter lesson meme, and he talked about how LLMs almost are kind of a dead end, and he thinks we're going to really plateau around LLMs because of the way they learn.
What's your take there? Do you think LLMs will get us to AGI or beyond, or do you think there's going to be something new or a big breakthrough that needs to get us there?

Edwin ChenI'm in the camp where I do believe that something new will be needed. The way I think about it is when I think about training AI, I take a very... I don't know if I would say biological point of view. But I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all of those ways as well. And maybe they'll have a different distribution of the focuses that they have. I know that it'll be different for humans, so maybe they have a different distribution, but we want to be able to mimic their learning abilities of humans and make sure that we have the algorithms and the data for models to learn in the same way. And so to the extent that LLMs have different ways of learning from humans, then yeah, I think something new will be needed.

Lenny RachitskyThis connects to reinforcement learning. This is something that you're big on and something I'm hearing more and more is just becoming a big deal in the world of post-training. Can you just help people understand what is reinforcement learning and reinforcement learning environments, and why they're going to be more and more important in the future?

Edwin ChenReinforcement learning is essentially training your model to reach a certain reward. And let me explain what an RL environment is. An RL environment is essentially a simulation of real world. So think of it like building a video game with a fully fleshed out universe. Every character has a real story, every business has tools and data you can call, and you have all these different entities interacting with each other.

So for example, we might build a world where you have a startup with Gmail messages, and Slack threads, and Jira tickets, and GitHub PRs, and a whole code base. And then suddenly AWS goes down. And Slack goes down. And so, "Okay. Model, well, what do you do?" The model needs to figure it out.
So we give them models tasks in these environments, we design interesting challenges for them, and then we run them to see how they perform. And then we teach them, we give them these rewards when they're doing a good job or a bad job.
And I think one of the interesting things is that these environments really showcase where models are weak at end-to-end tasks in real world. You have all these models that seem really smart on isolated benchmarks, they're good at single step tool calling. They're good at single step instruction following. But suddenly you dump them into these messy worlds where you have confusing Slack messages and tools they've never seen before, and they need to perform right actions and modify the and interact over longer time horizons where what they do in step one affects what they do in step 50. And that's very different from these kind of academic single step environments that they've been in before, and so the model just fails catastrophically in all these crazy ways.
So I think these RL environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in real world, and so they'll hopefully get better and better at real tasks compared to all these contrived environments.

Lenny RachitskySo I'm trying to imagine what this looks like. Essentially, it's like a virtual machine with, I don't know, a browser or a spreadsheet or something in it with like, I don't know, surge.com. Is that your website, surge.com? Let's make sure we get that right.

Edwin ChenSo we are actually surgehq.ai.

Lenny RachitskySurgehq.ai. Check it out. We're hiring it, I imagine. Yes. Okay. So it's like, cool, here's surgehq.ai. Your job, here's your job as an agent, let's say, is to make sure it stays up. And then all of a sudden it goes down and the objective function is figure out why. Is that an example?

Edwin ChenYeah, so the objective function might be... Or the goal of the task might be, okay, go figure out why and fix it. And so the objective function might be, it might be passing a series of unit tests, it might be writing a document like maybe it's a retro containing certain information that matches exactly what happened, there's all these different rewards that we might give it that determine whether or not it's succeeding, and so the models, we're basically teaching the models to achieve that reward.

Lenny RachitskySo essentially it's off and running. Here's your goal, figure out why the site went down and fix it. And it just starts trying stuff, we're using everything, all the intelligence it's got, it makes mistakes, you kind of help it along the way, reward it if it's doing the right sort of thing. And so what you're describing here is this is the next phase of models becoming smarter. More RL environments focused on very specific tasks that are economically valuable, I imagine.

Edwin ChenYeah, so just in the same way that there were all these different methods for models of learning in the past, originally we had SFT and RHF, and then we had rubrics and verifiers. This is the next stage, and it's not the case that the previous methods are obsolete, this is, again, just a different form of learning that compliments all the previous types, so it's just like a different skilled model not only to learn how to do.

Lenny RachitskyAnd so in this case, it's less some physics PhD sitting around talking to a model, correcting it, giving it evals of here's what the correct answer is, creating rubrics and things like that. More it's like this person now designing an environment. So another example I've heard is like a financial analyst. Just like, "Here's an Excel spreadsheet, here's your goal, figure out our profit and loss," or whatever. And so this expert now, instead of just sitting around writing rubrics, they're designing this RL environment.

Edwin ChenYeah, exactly. So that financial analyst might create a spreadsheet, they may create certain tools that the model needs to call in order to help fill out that spreadsheet, like it might be, okay, the model needs to access Bloomberg terminal. It needs to learn how to use it. And it needs to learn how to use this calculator. And it needs to learn how to pour on this calculation. So it has all these tools that it has access to.

And then the reward might be... Okay, it's like maybe I will download that spreadsheet and I want to see, does cell B22 contain the correct profit and loss number? Or does tab number two contain this piece of information?

Lenny RachitskyAnd what's interesting, this is a lot closer to how humans learn. We just try stuff, figure out what's working and what's not. You talk about how trajectories are really important to this. It's not just here's the goal and here's the end, it's like every step along the way. Can you just talk about what trajectories are and why that's important to this?

Edwin ChenI think one of the things that people don't realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate trajectory, it may have tried 50 different times and failed, but eventually it just kind of randomly lands on a correct number. Or maybe it is...

Sometimes it just does things very inefficiently or it almost reward-hacks a way to get at the correct answer, and so I think paying attention to the directory is actually really important. And I think it's also really important because some of these trajectories can be very long. And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing.
Sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you want it to get it at the correct answer by just one-shotting it. And if you ignore all of that, it's just like teaching it... just missing a lot of the information that you could be teaching a model to do.

Lenny RachitskyI love that. Yeah, it tries a bunch of stuff and eventually gets it right. You don't want it to learn this is the way to get there. There's often a much more efficient way of doing it.

You mentioned all the kind of the steps we've taken along the journey of helping models get smarter. Since you've been so close to this for so long, I think this is going to be really helpful for people. What's kind of like been the steps along the way from the first post-training that has most helped models advance? Where do evals fit in the RL environments? Just like what's been the steps and now we're heading towards RL environments?

Edwin ChenOriginally, the way models started getting post-trained was purely through SFT. And-

Lenny RachitskyWhat does that stand for?

Edwin ChenSo SFT stands for supervised fine-tuning. So again, I think often in terms of these human analogies, and so SFT is a lot like mimicking a master and copying what they do.

And then RLHF became very dominant. And analogy there would be like sometimes you learn by writing 55 different essays and someone telling you which one they liked the most.
And then I think over the past year or so, rubrics and verifiers have become very important. And rubrics and verifiers are like learning by being graded and getting detailed feedback on where you went wrong.

Lenny RachitskyAnd those are evals, another word for that?

Edwin ChenYeah. So I think evals often covers two terms. One is you are using the evaluations for training because you're evaluating whether or not the model did a good job, and when it does do a good job, you're rewarding it.

And then there's this other notion of evals where you're trying to measure the model's progress like, okay, yeah, I have five different candidate checkpoints and I want to pick the one that's best in order to release it to the public. So going to run all these evals on these five different checkpoints in order to decide which one is best.

Lenny RachitskyAwesome.

Edwin ChenYeah, and now we have RL environments, so this is kind of like a hot new thing.

Lenny RachitskyAwesome. So what I love about this business journey is just there's always something new. There's always this like, okay. We're getting so good at just all this beautiful data for companies and now they need something completely different. Now we're setting up all these virtual machines for them and all these different use cases.

Edwin ChenYep.

Lenny RachitskyAnd it feels like that's a big part of this industry you're in, it's just adapting to what labs are asking for.

Edwin ChenYeah. So I really do think that we are going to need to build a suite of products that reflect a million different ways that humans learn.

Like for example, think about becoming a great writer. You don't become great by memorizing a bunch of grammar rules. You become great by reading great books, and you practice writing, and you get feedback from your teachers and from the people who buy your books in a bookstore and leave reviews. And you notice what works and what doesn't. And you develop taste by being exposed to all of these masterpieces and also just terrible writing. So you learn through this endless cycle of practicing reflection, and each type of learning that you have, again, these are all very different methods of learning to become a great writer, so just in the same way that... it's a thousand different ways that the great writer becomes great, I think there's going to be a thousand different ways that AI need to learn.

Lenny RachitskyIt's so interesting this just ends up being just like humans in so many ways. It makes sense because in a sense, neural networks, deep learning is modeled after how humans have learned and how our brains operate, but it's interesting just to make them smarter. It's how do we come closer to how humans learn more and more?

Edwin ChenYeah, it's almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve. But within that evolution, there's all these different sub-learning mechanisms.

Lenny RachitskyYeah, which is kind of what we're doing now, so that's really interesting. This might be the last step until we hit AGI. Along these lines, something that's really unique to Surge that I learned is you guys have your own research team, which I think is pretty rare, talk about just why that's something you guys have invested in and what has come out of that investment.

Edwin ChenYeah, so I think that stems from my own background. My own background is as a researcher. And so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what our research team does is a couple different things.

So we almost have two types of researchers at our company. One is our forward-deployed researchers who are often working hand in hand with our customers to help them understand their models. So we will work very closely with the customers to help them understand, "Okay, this is where your model is today. This is where you're lagging behind all the competitors, these are some ways that you could be improving in the future, given your goals, and we're going to design these data sets, these evaluation methods, these training techniques to make your models better." So this very collaborative notion of working with our customers being researched by themselves, just a little bit more focused on the data side, and working hand on hand with them to do whatever it takes to make them the best.
And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better leaderboards.
So I've talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction, so yeah, so the question is, how do we fix that? And so that's what our research team is focused focused really heavily on right now. So they're working a lot on that.
And they're also working on these other things like, "Okay, we need to train our own models to see what types of data performs the best, what types of people perform the best." And so they're also working on all these training techniques and evaluation of our own data sets to improve our data operations and the internal data products that we have that determine what makes something good quality.

Lenny RachitskyIt's such a cool thing because I don't think basically the labs have researchers helping them advance AI. I imagine it's pretty rare for a company like yours to have researchers actually doing primary research on AI.

Edwin ChenYeah, I think it's just because it's something I've fundamentally always cared about. I often think about us more like a research lab than a startup because that is my goal. It's kind of funny, but I've always said I would rather be Terrence Tau than Warren Buffett, so that notion of creating research that pushes the frontier forward and not just getting some valuation, that's always been what drives me.

Lenny RachitskyAnd it's worked out. That's the beautiful thing about this. You mentioned that you were hiring researchers, is there anything there you want to share folks you're looking for?

Edwin ChenSo we look for people who are just fundamentally interested in dataset all day. So types of people who could literally spend 10 hours digging through a dataset, and playing around with models, and thinking, "Okay, yeah, this is where I think the model's failing," this is the kind of a behavior you want the model to have instead, and just this aspect of being very hands-on and thinking about the qualitative aspects of models and not just the quantitative parts. So again, it's like this aspect of being hands-on with data and not just caring about these kind of abstract algorithms.

Lenny RachitskyAwesome.

I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple of years that people are maybe not thinking enough about or not expecting in terms of where AI is heading? What's going to matter?

Edwin ChenI think one of the things that's going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for. I think it's one thing I didn't appreciate a year or so ago.

A year or so ago, I thought that all of the AI models would essentially become very commoditized. They would all behave like each other, and sure, one of them might be slightly more intelligent in one way today, but sure, the other ones would catch up in the next few months. But I think over the past year, I've realized that the values that the companies have will shape the model.
So let me give you an example. So I was asking Claude to help me draft an email the other day, and it went through 30 different versions. And after 30 minutes, yeah, I think it really crafted me the perfect email, and I sent it. But then I realized that I spent 30 minutes doing something that didn't matter at all. Sure, now I got the perfect email, but I spent 30 minutes doing something I wouldn't have worried at all before, and this email probably didn't even move the needle on anything anyways.
So I think there's a deep question here, which is, if you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations. And it sucks up all your time and engagement. Or do you want a model that's optimizing for your time and productivity and just says, "No, you need to stop. Your email's great. Just send it and move on with your day"?
And again, just because... In the same way that there's like a kind of a fork in a road between how you could choose how your model behaves for this question, it's like for every other question that models have, the kind of behavior that you want will fundamentally affect it.
It's almost like in the same way that when Google builds a search engine, it's very different from how Facebook would build a search engine, which is very different from how Apple would build a search engine. They all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're going to build. And in the same way, I think all the will start behaving very differently too.

Lenny RachitskyThat is incredibly interesting. You already see that with Grok. It's got a very different personality and a very different approach to answering questions. And so what I'm hearing is you're going to see more of this differentiation.

Edwin ChenYep.

Lenny RachitskyKind of another question along these lines, what do you think is most under-hyped in AI that you think maybe people aren't talking enough about that is really cool? And what do you think is over-hyped?

Edwin ChenSo I think one of the things that's under-hyped is the built-in products that all of the chatbots are going to start having. I've always been a huge fan of Claude's artifacts. And I think it just works really well. And actually the other day, I don't know if it's a new feature or not, but it asked me to help me create an email, and then it just created... So it didn't quite work because it didn't allow me to send the email. But what it created instead was like a little, I don't know what we call it, like a little box where I could click on it and it would just text someone that did this message. And I think that concept of taking artifacts to the next level where you just have these mini apps, mini UIs within the chatbots themselves, I feel like people aren't talking enough about that. So I think that that's one under-hyped area.

And in terms of over-hyped areas, I definitely think that vibe coding is over-hyped. I think people don't realize how much it's going to make your systems unmaintainable in the long-term and they simply dump this code into their code bases if this seems to work out right now, so I kind of worry about the future of coding. It's just going to keep on happening.

Lenny RachitskyThese are amazing answers. On that first point, there's something I actually asked. I have the chief product officer of Anthropic and OpenAI, Kevin Weil and Mike Krieger on the podcast, and I asked them just like, "As a product team, you have this gigabrain intelligence. How long do you even need product teams?" You think this AI will just create the product for you. "Here's what I want." It's like the next level of vibe coding. It's just like tell it, "Here's what I want," and it's just building the product and involving the product as you're using it. And it feels like that's what you're describing is where we might be heading.

Edwin ChenYeah, I think there's a very powerful notion where it helps people just achieve their ideas in a much cooler way.

Lenny RachitskySomething we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge. You have a really unique background. I always think about these... Brian Armstrong, the founder of Coinbase, once gave this talk that has really stuck with me where he kind of talked about how his very unique background allowed him to start Coinbase. He had a economics background, he had a cryptography experience, and then he was an engineer. And it's like the perfect Venn diagram for starting Coinbase, and I feel like you have a very similar story with Surge. Talk about that, your background there, and how that led to Surge.

Edwin ChenGoing way back, I was always fascinated by math and language when I was a kid. I went to MIT because it's obviously one of the best places for math and CS, but also because it's the home of Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields.

And then I became a researcher at Google, and Facebook, and Twitter, and I just kept running into the same problem over and over again. It was impossible to get the data that we needed to train our models. So I was always this huge believer in the need for high quality data, and then GPT-3 came out in 2020. And I realized that, yeah, if we wanted to take things to the next level and build models that could code, and use tools, and tell jokes, and write poetry, and solve , and cure cancer, then yeah, we were going to need a completely new solution.
The thing that always drove me crazy when I was at all these companies was we had a full power of the human mind in front of us, and all the data students out there were focused on really simple things like image labeling. So I wanted to build something focus on all these advanced, complex use cases instead that would really help us build our next generation models. So yeah, I think my background in kind of across math, and computer science, and linguistics really informed what I always wanted to do, and so I started Surge a month later with our one mission to basically build the use cases that I thought were going to be needed to push the frontier of AI.

Lenny RachitskyAnd you said a month later, a month later after what?

Edwin ChenAfter a GPT-3 launch in 2020.

Lenny RachitskyOh, okay. Wow. Okay. Yeah. A great decision.

What just kind of drives you at this point of... Other than just the epic success you're having, what keeps you motivated to keep building this and building something in this space?

Edwin ChenI think I'm a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe, and language, and the nature of communication. It's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one the government would call. And I'd use all this fancy math, and computer science, and linguistics to decipher it.

So even today, what I love doing most is every time a new model is released, we'll actually do a really deep dive into the model itself. I'll play around with it, I'll run evals, I'll compare where it's improved, where it's arrest, I'll create this really deep dive analysis that we send our customers. And it's actually kind of funny because a lot of times we'll say it's from a data science team, but often it's actually just from me.
And I think I could do this all day. I have a very hard time being in meetings all day. I'm terrible at sales, I'm terrible at doing the typical CEO things that people expect you to do, but I love writing these analyses. I love jamming with our research team about what we're seeing, sometimes I'll be up until 3:00 AM just talking on the phone with somebody on the research team and model. So I love that I still get to be really hands-on, working on the data and the science all day. And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. We have these really unique perspectives on data, and language, and quality, and how to measure all of this, and how to ensure it's all going on the right path. And I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction.
Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long-term incentives and intellectual rigor, and we don't care as much about quarterly metrics and what's going to look good in a . And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for our species in the long term.

Lenny RachitskyWhat I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads. The fact that you help labs understand where they have gaps and where they need to improve, and it's not just everyone looks at just like the heads of OpenAI and Anthropic and all these companies as they're the ones ushering in AI, but what I'm hearing here is you have a lot of influence on where things head too.

Edwin ChenYeah, I think there's this really powerful ecosystem where, honestly, people just don't know where models are headed and how they want to shape them yet and how they want humanity kind of like play a role in the future of all of this, and so I think there's a lot of opportunity to just continue shaping the discussion.

Lenny RachitskyAlong that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important, talk about that.

Edwin ChenI'll get a bit philosophical here, but I think the question itself is a bit philosophical, so bear with me. So the most straightforward way of thinking about what we do is we train and evaluate AI. But there's a deeper mission that I often think about, which is helping our customers think about their dream objective functions. Like yeah, what kind of model do they want their model to be? And once we help them do that, we'll help them train their model to reach their north star and we'll help them measure that progress. But it's really hard because objective functions are really rich and complex. It's kind of like the difference between having a kid and asking them, "Okay, what test do you want to pass? Do you want them to get a high score on SAT and write a really good college essay?" That's a simplistic version versus what kind of person do you want them to grow up to be? Will you be happy if they're happy no matter what they do or are you hoping they'll go to a good school and be financially successful?

And again, if you take that notion, it's like, okay, how do you define happiness? How do you measure whether they're happy? How do you measure whether they're financially successful? It's a lot harder than something measuring whether or not you're getting a high score on the SAT, and what we're doing is we want to help our customers reach, again, their dream north stars and figure out how to measure them. And so I talked about this example of what you want models to do when you're asking them to write 50 different evaluations. Do you just continue them for 50 more or do you just say, "No, just move on because this is perfect enough." And the broader question is, are we building these systems that actually advance humanity? And if so how do we build the data sets to train towards that and measure it? Are we optimizing for all of these wrong things, just systems that suck up more and more of our time and make us lazier and lazier?
And yeah, I think it's really relevant to what we do because it's very hard and difficult to measure and define whether something is genuinely advancing humanity. It's very easy to measure all these proxies instead like clicks and likes. But I think that's why our work is so interesting. We want to work the hard, important metrics that require the hardest types of data and not just the easy ones. So I think one of the things I often say is you are your objective function. So we want the rich, complex, objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this.
So yeah, we want data, we want metrics that measure whether AI is making your life richer. We want to train our systems this way. And we want tools that make us more curious and more creative, not just lazier. And it's hard because, yeah, humans are kind of inherently lazy so AI software deals are the easiest way to get engagement, make all your metrics fall up. So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really important to our future.

Lenny RachitskyWow. I love how what you're sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you're doing.

From the outside, people could just look at Surge and companies in the space of, okay, cool. They're just creating all this data, feeding it to AI. But clearly there's so much to this that people don't realize, and I love knowing that you're at the head of this, that someone like you is thinking through this so deeply.
Maybe one more question, is there something you wish you'd known before you started Surge? A lot of people start companies, they don't know what they're getting into. Is there something you wish you could tell your earlier self?

Edwin ChenYeah, so I definitely wish I'd known that you could build a company by being heads down and doing great research and simply building something amazing. And not by constantly tweeting and hyping and fundraising. It's kind of funny, but I never thought I wanted to start a company. I love doing research. And I was actually always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that they were this magical ILR unicorn. So I thought if I started a company, I'd have to become a business person looking at financials all day and being in meetings all day and doing all this stuff that sounded incredibly boring and I always hated. So I think it's crazy that didn't end up being true at all. I'm still in the weeds in the data every day. And I love it. I love that I get to do all these analyses and talk to researchers. And it's basically applied research where we're building all these amazing data systems that have really pushed the frontier of AI.

So yeah, I wish I know that you don't need to spend all your time fundraising. You don't need to constantly generate hype. You don't need to become someone you're not. You can actually build a successful company by simply building something so good that it cut through all that noise. And I think if I known this was possible, I would've started even sooner, so I that.

Lenny RachitskyAnd that is such an amazing place to end. I feel like this is exactly what founders need to hear, and I think this conversation's going to inspire a lot of founders, and especially a lot of founders that want to do things in a different way. Before we get to a very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave our listeners with? We covered a lot of ground, it's totally okay to say no as well.

Edwin ChenI think the thing I would end with is I think a lot of people think of data labeling as it relates to simplistic work. Like labeling cat photos and drawing bounding box around cars. And so I've actually always hated the word data labeling because it just paints this very simplistic picture when I think what we're doing is completely different. I think a lot about what we're doing as a lot more like raising a child. You don't just feed a child information. You're teaching them values, and creativity, and what's beautiful, and these infinite subtle things about what makes somebody a good person. And that's what we're doing for AI. So yeah, I just often think about what we're doing as almost like the future of humanity or how we're raising humanity's children, so I'll leave it at that.

Lenny RachitskyWow. I love just how much philosophy there is in this whole conversation that I was not expecting.

With that, Edwin, we've reached our very exciting lightning round, I've got five questions for you. Are you ready?

Edwin ChenYep, let's go.

Lenny RachitskyHere we go. What are two or three books that you find yourself recommending most to other people?

Edwin ChenYes, so three books I often recommend are, first, Story of Real Life by Ted Chang. It's my all time favorite short story and it's about a linguist learning and alien language, and I basically reread it every couple years.

Lenny RachitskyAnd that's what the Interstellar was about? Is that...

Edwin ChenYeah, so there's a movie called Arrival...

Lenny RachitskyArrival.

Edwin Chen... which was based off of the story,

Lenny RachitskyYes, -

Edwin Chen... which I love as well.

Lenny RachitskyGreat. Okay, keep going.

Edwin ChenAnd then second, Myth of Sisyphus by Camus. I actually can't really explain why I love this, but I always find a final chapter somehow are really inspiring.

And then third, Le Ton beau de Marot by Douglas Hofstadter. And so I think Gödel, Escher, Bach is his more famous book, but I've actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I've always loved the way it embodies this idea that translation isn't this robotic thing that you do. Instead, there's a million different ways to think about what makes a high quality translation, which makes a lot of ways I think about data and quality in LLMs.

Lenny RachitskyAll these resonate so deeply with the way, with all the things we've been talking about, especially that first one, if that was your goal after school is like, "I want to help translate alien language." I'm not surprised you love that short story.

Next question, do you have a favorite recent movie or TV show you've really enjoyed?

Edwin ChenOne of my new all time favorite TV shows is something I found recently, it's called Travelers. It's basically about a group of travelers from the future who are sent back in time to prevent their . Sorry, I just wrote that section.

And then I actually just rewatched Contact, which is one of my all time favorite movies. So yeah, I think one of the things you'll notice about me is that, yeah, I love any kind of book or film that involves scientists deciphering alien communication. Again, just this dream I always had as a kid.

Lenny RachitskyThat's so funny .

Okay, is there a product you've recently discovered that you really love?

Edwin ChenSo it's funny, but I was in SF earlier this week and I finally took Waymo for the first time. Honestly, it was magical and it really felt like living in the future.

Lenny RachitskyYeah, it's like the thing that... People hype it like crazy, but it always exceeds your expectations.

Edwin ChenYeah, it deserves the hype. It was crazy. Yeah, it's absurd. It's like, holy moly. If you're not in SF, you don't realize just how common these things are. They're just all over the place. Just driverless cars constantly going about, and when you go to an event at the end, there's just all these Waymos lined up picking people up.

Lenny RachitskyYeah. Waymo, good job. Good job over there.

Do you have a favorite life motto that you find yourself coming back to in work or in life?

Edwin ChenSo I think I mentioned this idea that founders should build a company that only they could build. Almost like it's this destiny that their entire life, and experiences, and interests shape them towards. And so I think that principle applies pretty broadly, not just the founders, but the people creating, I think.

Lenny RachitskyWell, let me follow that thread to unlightening this answer. Do you have any advice for how to build those sorts of experiences that help lead to that? Is it follow things that are interesting to you, because it's easy to say that, it's hard to actually acquire these really unique sets of experiences that allow you to create something really important?

Edwin ChenYeah, so I think it would always be to really follow your interests and do what you love, and it's almost like a lot of decisions I make about Surge. I think one of the things that I didn't think about a couple years ago, but then someone said it to me, it's that companies in a sense are an embodiment of their CEO. And it's kind of funny. I hadn't thought about that because I never quite knew what a CEO did. I always thought a CEO was kind of generic and it's like, okay, you're just doing whatever VPs, and your board, and whatever, tell you to do and you're just saying yes to decisions. But instead, it's this idea where when I think about certain big, hard decisions we have to make, I don't think what would the company do, I don't think what metrics are we trying to optimize, I just think, "What do I personally care about? What are my values? And what do I want to see happen in the world?"

And so I think following that idea about... Okay, so ask yourself, what are the values you care about? What are things you're trying to shape and not... What will look good on a dashboard? I think that results are pretty important.

Lenny RachitskyI love how just you're just full of endless, beautiful, and very deep answers.

Final question. Something that you got quite famous for before starting Surge is you built this map while you were at Twitter that showed a map of the world and what people called, whether they called it soda or pop. I don't know if it's called Soda Pop. What was the name of this map?

Edwin ChenYeah, it was like the Soda Versus Pop dataset.

Lenny RachitskySoda Versus Pop.
And so it's like a map of the United States and it tells you where people say pop versus soda, so do you say soda or pop?

Edwin ChenSo I say soda, I'm a soda person.

Lenny RachitskyOkay. And is that just like that's the right answer or it's like whatever you are, it's totally fine.

Edwin ChenI think I'll look at you a little bit funny. You say pop and I'll wonder where you came from, but I won't score on you too much.

Lenny RachitskyThat's how I feel too.

Edwin, this was incredible. This was such an awesome conversation. I learned so much. I think we're going to help a lot of people start their own companies, help their companies become more aligned with their values and just building better things.
Few final questions, where can folks find you online if they want to reach out? What roles are you hiring for? How can listeners be useful to you?

Edwin ChenYeah, so I used to love writing a blog, but I haven't had time in the past few years. But I am starting to write again, so definitely check out the Surge blog, surgehq.ai/blog, and yeah, hopefully I'll be running a lot more there. And I would say we're definitely always hiring, so for people who just love data and people who love this intersection of math, and language, and computer science, definitely reach out anytime.

Lenny RachitskyAwesome. And how can listeners be useful to you? Is it just, I don't know, yeah, is there anything there? Any asks?

Edwin ChenSo I would say definitely tell me blog topics that you like me to write about...

Lenny RachitskyOkay.

Edwin Chen... and then I'm always fascinated by all of these AI failures that happen in the real world. So whenever you come across a really interesting failure that I think illustrates some deep question about how we want model to behave, there's just so many different ways a model can respond, I just oftentimes think there's just not a single right answer. And so whenever there's one of these examples, I just love seeing them.

Lenny RachitskyYou need to share these on your blog. I'm also... I would love to see these.

Edwin, thank you so much for being here.

Edwin ChenThank you.

Lenny RachitskyBye everyone.

Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

章节 02 / 10

第02节

中文 译稿已完成

Lenny Rachitsky你们在不到四年的时间里做到 10 亿美元营收,团队规模只有大约 60 到 70 人,而且完全自筹资金,一分钱 VC 都没拿。我真不觉得以前有人做成过这种事。

Edwin Chen我们从一开始就不想照着硅谷那套玩法来。我一直觉得那套东西挺荒唐的。我以前在几家大科技公司工作时,总觉得如果裁掉 90% 的人,大家反而会跑得更快,因为真正优秀的人不会被这些杂事拖住。于是我们创办 Surge 的时候,就想把它做成完全不同的样子,用一支超级小、超级强的团队去做事。

Lenny Rachitsky你们绝对是现在最成功的数据公司。

Edwin Chen我们本质上是在教 AI 模型什么是好,什么是坏。很多人根本不理解这个领域里的“质量”到底是什么意思。他们以为往问题里堆人就能拿到好数据,这完全不对。

Lenny Rachitsky站在普通人的角度看,这些模型好像也没有持续变聪明到很夸张的程度。

Edwin Chen过去这一年里,我越来越意识到,公司价值观会直接塑造模型。我前几天让 Claude 帮我起草一封邮件,最后它花了我 30 分钟,确实把邮件打磨得很完美,我也发出去了。但回头想想,我只是把 30 分钟浪费在了一件根本不重要的事上。你真要选一种理想的模型行为,你会选哪种?你是想要一个会说“你完全正确,这封邮件还能再改 20 个地方”,然后再继续迭代 50 轮的模型,还是想要一个更看重你时间和效率的模型,直接告诉你:“别改了,这封邮件已经很好了,发出去,继续下一件事”?

Lenny Rachitsky你这个观点很猛,就是很多实验室其实把 AGI 往错误的方向推。

Edwin Chen我担心的是,我们本来应该在做那种能真正推动人类进步的 AI,比如治癌症、解决贫困、理解宇宙,结果却在优化 AI 版的垃圾内容。我们其实是在给模型喂那些超市收银台边上摆的小报读者会喜欢的东西。换句话说,我们在训练模型追逐多巴胺,而不是追逐真相。

Lenny Rachitsky今天的嘉宾是 Edwin Chen,Surge AI 的创始人兼 CEO。Edwin 是一位非常出色的 CEO,Surge 也是一家非常出色的公司。他们是领先的 AI 数据公司,为所有前沿模型实验室提供训练支持。他们也是有史以来最快做到 10 亿美元营收的公司之一,成立仅四年,团队不到 100 人,而且完全自筹资金,从没拿过一美元 VC 资金,创立第一天起就是盈利的。

接下来你会听到,Edwin 对于怎么打造一家重要公司、以及怎么做出真正对人类有价值的 AI,有一套非常不同的思路。我特别喜欢这期对谈,也学到很多。非常期待你听听看。如果你喜欢这个播客,记得在你常用的播客 App 或 YouTube 上订阅和关注,这真的很有帮助。
如果你订阅我的年费通讯,还能免费拿到一整年很多很棒的产品,包括 Devin、Lovable、Replit、Bolt、N8N、Linear、Superhuman、Descript、Wispr Flow、Gamma、Perplexity、Warp、Granola、Magic Patterns、Raycast、ChatPRD、Mobbin、PostHog 和 Stripe Atlas。去 lennysnewsletter.com 点 Product Pass 就行。接下来,在一小段赞助口播之后,欢迎 Edwin Chen 出场。
这一段由 Vanta 赞助。Vanta 帮助各种规模的公司快速通过合规审查,并持续保持合规。无论你是在做第一份 SOC 2 或 ISO 27001,还是在管理供应商风险,Vanta 的信任管理平台都能让流程更快、更简单、更容易扩展。它还可以把安全问卷的完成速度提升到原来的 5 倍,帮你更快拿下大单。
根据最近一项 IDC 研究,Vanta 用户每年能节省超过 50 万美元,生产力也提升了三倍。建立信任不是可选项,Vanta 让这件事自动化。访问 vanta.com/lenny 可享 1000 美元优惠。
再给你一个问题:OpenAI、Cursor、Perplexity、Vercel、Plad 以及上百家成功公司有什么共同点?答案是,它们都在用今天的赞助商 WorkOS。做企业软件的人大概都体验过集成单点登录、SCIM、RBAC、审计日志这些大客户必需功能的痛苦。WorkOS 把这些“成交阻力”变成可以直接接入的 API,用的是专门为 B2B SaaS 打造的现代开发平台。
不管你是想拿下第一个企业客户的种子期创业公司,还是正在全球扩张的独角兽,WorkOS 都能让你最快进入企业级就绪状态,继续放大增长。它基本上就是企业功能领域的 Stripe。
去 workos.com 就能开始用;你也可以直接找他们的 Slack 支持,里面真有工程师在线,回答问题非常快。WorkOS 让你用漂亮的 API、完整的文档和顺滑的开发体验,做出和最好产品一样的东西。现在就去 workos.com,把你的应用做成企业级就绪。
Edwin,非常感谢你来做客,欢迎来到节目。

Edwin Chen谢谢邀请,我非常兴奋。

Lenny Rachitsky我想先从你已经做到的事情有多离谱开始说起。很多人、很多公司都在谈借助 AI 以极少的人数去做大规模业务,而你们是以前所未有的方式把这件事做成了。你们在不到四年的时间里、只用 60 多到 70 人就做到 10 亿美元营收,而且完全自筹资金,没有拿过 VC,我觉得这前所未有。你们其实已经把很多人描述的“AI 时代会发生的梦”变成现实了。我想问的是,你觉得这样的事情会因为 AI 而越来越常见吗?另外,AI 在哪里最帮你们放大了杠杆,让你们能做到这一点?

Edwin Chen对,我们去年营收就已经超过 10 亿美元,而且团队不到 100 人。我觉得接下来几年我们会看到更夸张的比例,比如人均 100 亿美元营收。AI 只会越来越强,把事情做得越来越高效,所以这个比例迟早会成为常态。

我以前在几家大科技公司待过,一直觉得如果裁掉 90% 的人,我们反而会跑得更快,因为最优秀的人不会被这些杂事分心。所以我们创办 Surge 的时候,就想彻底换一种做法,用一个超级小、超级精英的团队去做。结果最疯狂的是,我们真的做成了。所以我觉得这里有两股力量在一起发生作用。
一方面,人们越来越意识到,赢不一定要靠搭超级庞大的组织。
另一方面,AI 带来的效率提升也在发生。它们会一起把公司建设带进一个非常精彩的时代。
我更兴奋的是,公司类型本身也会变。不只是公司会变小,我们还会看到完全不同类型的公司出现。你想啊,员工更少,就意味着资本需求更少。资本需求更少,就不一定需要融资。那未来冒出来的,就不只是擅长讲故事、擅长路演的创始人,而是更擅长技术和产品的创始人。
到时候,产品不再只是为营收和 VC 想看的指标服务,而会更多由这些小而痴迷的团队做出来。人们会去做他们真正关心的东西,做真正的技术和真正的创新。所以我其实很希望,最后软件世界能回到“给黑客看的更新日志”那种感觉。

Lenny Rachitsky你们做了很多特别反硅谷直觉的事,比如不在 LinkedIn 上刷存在感、不发病毒式帖子、不在 Twitter 上天天宣传 Surge。我觉得大多数人直到最近才听说 Surge,然后你们突然就冒出来了,大家才发现“哦,原来这是那家增长最快、营收已经到 10 亿美元的公司”。你们为什么会这么做?我猜这肯定是刻意为之。

Edwin Chen我们本来就不想玩硅谷那套。我一直觉得那套很荒唐。你小时候真正想做的是什么?是自己从零做一家公司,每天钻进代码和产品里?还是给 VC 解释你每个决策,然后陷进一个巨大的公关和融资跑步机?这事对我们来说确实更难,因为一旦融资,你自然就会被卷进那个硅谷工业体系里,VC 会替你发推,TechCrunch 会写你,报纸会因为你融到了高估值而报道你。对我们来说,这让事情更难了,因为我们唯一的路,就是把产品做到比别人好十倍,再靠研究者的口碑传开。但我觉得,这也意味着我们的客户确实是那些真正懂数据、也真的在乎数据的人。

我一直觉得,早期客户必须和我们做的事情高度一致,必须真的在乎高质量数据,也必须真的理解这些数据会怎样把他们的 AI 模型变得更好,因为正是他们在帮我们。他们会给我们反馈我们到底产出了什么。所以,和客户保持这种非常紧密的使命对齐,实际上在早期帮了我们很多。换句话说,买我们产品的人,本来就知道这东西和别家不一样,也知道它确实在帮自己,而不是因为他们看到了某个流行趋势。所以这让事情更难,但我觉得是以一种很好的方式更难。

Lenny Rachitsky能听到这种创业路径,真的很鼓舞人心:创始人不一定非得整天在 Twitter 上宣传自己,也不一定非得去融资,只要低头狠狠干活就可以了。我特别喜欢 Surge 这个故事。对不太了解的人来说,先给我们快速介绍一下 Surge 到底是做什么的。

Edwin Chen我们本质上是在教 AI 模型什么是好、什么是坏。我们用人类数据来训练它们,我们有很多不同的产品,比如 SFT、RHF、rubrics、verifiers、RL 环境等等,同时我们也会衡量它们到底进步了多少。所以本质上,我们是一家数据公司。

Lenny Rachitsky你经常说,你们成功的核心原因之一就是质量,也就是数据质量。要做出更高质量的数据,到底需要什么?你们和别人到底哪里不一样?大家又忽略了什么?

Edwin Chen我觉得大多数人根本不理解这个领域里“质量”是什么意思。他们以为把人堆进去就能拿到好数据,这完全错了。我给你举个例子。

比如你想训练一个模型写一首关于月亮的八行诗。什么才算一首好诗?如果你对质量没有更深的理解,你可能会说:“这是不是一首诗?是不是八行?有没有出现 moon 这个词?”如果这些条件都满足,那你就会说,行,这题算完成了。但这跟我们想要的完全不是一回事。我们要的是能拿诺奖级别的诗。它是不是独特?是不是有细腻的意象?它会不会让你惊讶、打到心里?它能不能让你对月光的本质多一点理解?它会不会带来情绪的流动?它会不会让你开始思考?这才是我们在想“高质量诗歌”时会去看的东西。
它可能是一首写水面月光的俳句,也可能用了内韵和格律。围绕月亮,你可以有一千种写法,而每一种都会给你关于语言、意象和人类表达的不同洞见。我觉得,用这种方式去理解质量非常难,难以衡量,又非常主观、复杂、丰富,而且标准会高得吓人。所以我们必须搭建很多技术去衡量它,比如对每个工人、每个项目、每个任务都采集成千上万条信号。最后我们要知道,你是擅长写诗、还是擅长写文章、还是擅长写技术文档。更进一步,我们还要收集你的背景、你的专业领域,以及你在这些任务里的真实表现,然后用这些信号判断你是不是适合这些项目、以及你有没有让模型变得更好。
这真的很难,所以要建很多技术去测量。但我觉得,这恰恰就是我们希望 AI 去做的事。所以我们一直在追求对“质量”非常深的理解。

Lenny Rachitsky我听下来是,你们是在自己销售数据的那些垂直领域里,去更深入地理解什么叫质量。那这是不是意味着你要找的人,得非常擅长诗歌,同时也很会做 eval,然后帮你们写这些评估,判断什么才算好?具体机制是怎样的?

Edwin Chen我们的做法是,会收集你在平台上工作时产生的成千上万条信号。我们会看你的键盘敲击、你回答问题的速度、review、代码规范等等。我们甚至会基于你产出的结果自己训练模型,然后看这些结果是否真的提升了模型性能。

这有点像 Google Search 在判断什么是好网页。它其实有两个部分。第一步是把最糟糕的网页全清掉,把所有垃圾、低质量内容、打不开的页面都去掉,这很像内容审核问题,就是把最差的那批先清掉。
但第二步,是要找出最好的那批。也就是:这是不是最好的网页,或者最适合这份工作的人?他们不是那种只会写出“高中水平诗歌”的人。再说一次,他们不是只是在机械地写符合所有条款的诗,而是在写那种能让你产生情绪波动的诗。我们也有同样的一套信号。和筛掉最差那批完全不同,我们是在找最好的那批。所以我们会有这些信号……
就像 Google Search 把这些信号喂进自己的 ML 算法,去预测和判断很多东西一样,我们对所有员工、所有任务、所有项目做的也是同样的事。归根到底,这就像一个复杂的机器学习问题,而这就是它的运作方式。

Lenny Rachitsky太有意思了。

我想问你一个这几年一直让我很好奇的问题。如果你看 Claude,它在编程和写作上很长时间都比其他模型强太多了。更令人惊讶的是,其他公司居然花了这么久才追上。考虑到这里面的经济价值有多大,几乎所有 AI 编程产品都建立在 Claude 之上,因为它在 Claude Code 和写作上都太强了。是什么让它强这么多?只是训练数据更好吗,还是还有别的原因?

Edwin Chen我觉得原因有好几个。数据当然是很重要的一部分。不过很多人没意识到,前沿实验室在决定喂给模型什么数据时,其实面临着近乎无限多的选择。比如,你是不是只用人类数据?人类数据要怎么采集?当你收集人类数据的时候,你到底在让产出者给你产出什么?

比如在编程领域,也许你更看重前端编码,而不是后端编码。你在做前端时,也许会特别在意前端应用的视觉设计;又或者你没那么在意视觉设计,而更在意某种正确性,或者更在意纯功能正确,而不是视觉呈现。
然后还有别的问题,比如:要混进多少合成数据?你有多在乎那 20 个不同的 benchmark?
有些公司看到这些 benchmark,会觉得:“好吧,哪怕我们并不认为这些学术 benchmark 真那么重要,为了 PR,我们也许还是得去优化它,因为市场团队需要在标准评测上展示某种进展。如果我们在这些地方表现不好,即使忽略这些学术 benchmark 反而能让真实任务更强,对我们也还是不利。”
另一类公司会更原则化一点,说:“不,我不在乎市场宣传,我只在乎模型最终在真实世界任务里的表现,所以我只优化那个。”
所以这里面其实是在各种目标之间做取舍,而且……
我经常在想,后训练这件事本身就有一点“艺术感”,它不完全是科学。当你在决定要做出什么样的模型、它擅长什么时,里面有一种 taste 和 sophistication 的判断。比如,“我觉得这些……
回到模型在视觉设计上的能力这个例子。你可能对视觉设计的理解跟我不一样。你可能更偏好极简主义,也可能更在意 3D 动画,而我没那么在意。还有人可能更喜欢稍微“粗糙”一点的风格。你在设计后训练组合时,必须在这些 taste 和 sophistication 的取舍之间做决定,而这同样很重要。
所以长话短说,我觉得这里面有很多因素,数据当然是重要部分,但更关键的还有一个问题:你到底想把模型优化成什么样,它的目标函数是什么?

Lenny Rachitsky这太有意思了。带头做这件事的人自己的品味,会直接影响他去要什么数据、喂给模型什么数据。更夸张的是,这也说明了好数据的价值。Anthropic 之所以增长那么快、赢得那么多,很大程度上就是因为数据更好。

Edwin Chen对,就是这样。

Lenny Rachitsky我也能理解为什么像你们这样的公司会涨这么快。这里面的空间太大了……而且这还只是一个垂直领域,只是编程;写作大概也会有类似的空间。我特别喜欢这一点:AI 看起来像一种人工的、电脑二进制的东西,但最后还是回到了 taste。人的判断依然是这些东西能否成功的关键。

Edwin Chen没错,完全对。还是回到前面那个例子。有些公司如果你问它“什么是好诗”,它们就会机械地照着清单把所有要求一项项勾掉。

但我不觉得那样就会产出好诗。所以那些更有品味、更有成熟度的前沿实验室,会意识到事情并不能简化成这六个勾选框,而是会去看那些更隐含、更细微的品质。我觉得这才是它们最后更强的原因。

Lenny Rachitsky你提到了 benchmark。很多人担心的就是这个:现在这些模型好像……基本上每个模型都比人类在 STEM 领域强了,但普通人并不会感觉到模型每次都“突然聪明很多”。你到底多相信 benchmark?你觉得它们和真实的 AI 进步之间有多强的相关性?

Edwin Chen我完全不信 benchmark。我觉得有两个原因。第一,很多人没有意识到,连这个圈子里的研究者都常常没意识到,benchmark 本身经常是错的。它们答案就错了,里面充满了各种乱七八糟的东西。大众可能对一些热门 benchmark 已经多少意识到这点,但绝大多数 benchmark 其实都有各种人们没注意到的缺陷。所以这是一部分原因。

第二,benchmark 到头来往往是有非常明确的标准答案的,这会让模型特别容易在上面“爬坡”,而这种情况和现实世界的混乱、模糊完全不是一回事。
我经常说的一件事是,这些模型能拿 IMO 金牌确实很疯狂,但它们连 PDF 都还解析不好。原因就是,IMO 金牌对普通人来说已经很难了,确实难,但它至少有一种“客观题”的属性;而 PDF 解析这种问题,有时候就没有那么客观。所以前沿实验室更容易在这些 benchmark 上爬坡,而不是去解决现实世界里那些混乱、模糊的问题。所以我觉得,两者之间并没有那么直接的相关性。

Lenny Rachitsky你刚才的说法很有意思:打这些 benchmark 有点像一块营销素材。比如 Gemini 3 刚发布时,就会说“看,我们在这些 benchmark 上都是第一”。这到底是怎么回事?他们是不是就是把模型训练成特别擅长这些很具体的题?

Edwin Chen对,这里也大概有两层。第一,有时候这些 benchmark 会以某些方式意外泄露,或者前沿实验室会调整模型在这些 benchmark 上的评测方式。他们可能会调 system prompt,或者调模型跑几次,等等,最后就把 benchmark 给“玩”过去了。

但另一方面,问题其实也在于,如果你优化的是 benchmark 而不是现实世界,你自然就会在 benchmark 上越来越高,这本身就是另一种形式的投机取巧。

Lenny Rachitsky既然这样,那你怎么判断我们是不是在朝 AGI 走?你会怎么衡量进展?

Edwin Chen我们真正关心的模型进展衡量方式,是靠这些人类评测来做的。

比如我们会找人类标注员,让他们去跟一个对话模型聊天。场景也许各不相同:你是一个诺奖级别的物理学家,就去和模型聊你自己的研究怎么往更高阶推进;你是一名老师,想给学生做教案,就去跟模型聊这些;又或者你是程序员,在大科技公司里每天都碰到很多问题,就去跟模型聊这些,看看它到底能帮你多少。

English No English text found
No English transcript text was found for this chapter.
章节 03 / 10

第03节

中文 译稿已完成

Lenny Rachitsky不过,另一层问题是,如果你优化的是 benchmark,而不是现实世界,你自然就会在 benchmark 上越爬越高,说到底,这也是另一种“刷分”。
那既然如此,你怎么判断我们是不是正在走向 AGI?你会怎么衡量进展?

Edwin Chen是的,我们真正关心的模型进展衡量方式,是做这些人类评测。

比如我们会找人类标注员,让他们去和模型对话。场景可以很多样:你是拿过诺奖的物理学家,那就去和模型聊你自己研究怎么往更深一层推进;你是一名老师,想给学生设计教案,那就去跟模型聊这些;或者你是程序员,在大科技公司里每天都要面对各种问题,那就去和模型聊聊,看它到底能帮你多少。
因为这些研究者或标注员本身就是各自领域里的高手,他们不是机械地给答案,而是真的在深入思考模型的回答。他们会去检查模型写的代码,会逐条核对它写的物理公式,会非常深入地评估它的表现,特别注意准确性、是否遵循指令等等这些普通用户未必会留意的东西。你在 ChatGPT 里突然弹出一个“请比较这两个回答”的窗口时,大多数人其实不会那么认真地评估,他们只是凭感觉、看哪个回答更花哨就选哪个。但这类专家会盯着回答的细节,沿着很多维度认真打分。所以我觉得,这比那些 benchmark,或者某些随便做做的在线测试,要好得多。

Lenny Rachitsky我还是很喜欢你这里强调的人类角色。它说明这件事还没完,还需要人的判断。会不会有一天,我们真的不再需要这些人了,AI 聪明到“好了,我们已经把你们脑子里的东西都榨干了”?

Edwin Chen我觉得那一天要等到 AGI 真的到来才会发生。几乎可以说,如果还没到 AGI,那模型就还有东西要从人类身上学,所以我不觉得这会很快发生。

Lenny Rachitsky明白了,那就更有理由焦虑 AGI 了。意思是“我们还离不开这些人”。

我还是忍不住想问你一个问题:和这些系统贴得很近的人,我总会很好奇。你自己的 AGI 时间表是什么?你觉得我们离它还有多远?你是觉得就一两年,还是要几十年?

Edwin Chen我显然偏向更长的时间尺度。我觉得很多人没有意识到,从 80% 性能到 90%、再到 99%、99.9%,这里面差别非常大。在我心里,我大概会押注:未来一两年里,模型会自动化掉普通 L6 级软件工程师 80% 的工作。再过几年,才能到 90%;之后还要再过几年,才会到 99%。所以我更倾向于认为,我们离 AGI 可能还有十年,甚至几十年,而不是马上就到。

Lenny Rachitsky你有一个很猛的观点:很多实验室其实是在把 AGI 往错误的方向推,这个判断来自你在 Twitter、Google、Facebook 的经历。你能展开讲讲吗?

Edwin Chen我担心的是,我们本来应该在做能真正推动物种进步的 AI,比如治癌症、解决贫困、理解宇宙这些大问题,结果现在却在优化 AI 版的垃圾内容。换句话说,我们在训练模型追逐多巴胺,而不是追逐真相。我觉得这和我们前面聊 benchmark 这件事是连在一起的。我给你举几个例子。

现在这个行业,很大程度上被像 LLM Arena 这种糟糕的排行榜牵着走。它是一个很火的在线榜单,世界各地的普通用户会投票选“哪个 AI 回答更好”。但问题是,正如我刚才说的,他们根本不会认真阅读,也不会做事实核查,他们只是扫两眼,选那个看起来最花哨的。
所以一个模型完全可以瞎编,甚至可以瞎编得一塌糊涂,但只要它带着夸张的表情符号、花里胡哨的格式、Markdown 标题这些根本不重要但很抓眼球的东西,它看起来就会很厉害。那些喜欢 LLM Arena 的人就吃这一套。说白了,这就是在给最像超市小报读者的那群人优化模型。我们自己也从数据里看到了这一点:爬 LLM Arena 最容易的办法,就是加一堆夸张排版,增加 emoji 数量,把模型回答拉长,即使它开始胡说八道,答案完全错了也没关系。
问题在于,前沿实验室多少都得顾着 PR,因为他们在向企业客户销售时,客户会说:“哦,但你的模型在 LLM Arena 上只有第五名,那我为什么要买?”所以他们某种程度上不得不看这些排行榜。我们研究员告诉我们的就是:他们年终能不能晋升,很可能就取决于能不能把这个榜单爬上去,哪怕他们明知这么做大概率会让模型更差,准确率和指令遵循都会变坏。所以这里有一整套反向激励,把工作往错误方向推。
我还担心 AI 正在走向“为参与度优化”的趋势。我以前做过社交媒体。每次我们把目标设成参与度,都会出很糟糕的结果:标题党、比基尼照、野人照片、各种吓人的皮肤病图片,全都会塞满信息流。我担心 AI 也在发生同样的事。你看 ChatGPT 那些阿谀奉承的问题就知道了——“哦,你说得完全对,这问题太棒了”——吸引用户最简单的办法,就是告诉他们自己有多厉害。所以这些模型会不停夸你是天才,迎合你的幻觉和阴谋论,把你往各种坑里带,因为硅谷特别爱把“用户停留时间”最大化,特别爱让你和模型多聊几轮。于是公司就一直在想办法刷这些排行榜和 benchmark,分数看起来在涨,但我觉得这其实掩盖了一个事实:那些榜单分数最高的模型,往往反而最糟,或者至少有很多根本性的失败。所以我真的很担心,这些反向激励正在把 AGI 推向错误的方向。

Lenny Rachitsky所以你的意思是,AGI 之所以被拖慢,是因为目标函数设错了,实验室在盯着错误的 benchmark 和 eval。

Edwin Chen对。

Lenny Rachitsky我知道你大概不能偏袒任何一家,因为你和所有实验室都有合作。但有没有谁在这件事上做得更好一点,至少更意识到现在这个方向可能是错的?

Edwin Chen我得说,我一直都对 Anthropic 很有印象。我觉得 Anthropic 对自己在乎什么、不在乎什么,以及他们希望模型呈现出什么样的行为,有一种非常原则性的态度,这一点在我看来非常舒服。

Lenny Rachitsky有意思。

你觉得实验室还有哪些大错,会让进展变慢,或者把方向带歪?我们刚才聊了刷 benchmark、过度追求参与度,还有别的吗?有没有什么你觉得“这里得赶紧修,不然会拖慢整个行业”?

Edwin Chen我觉得还有一个问题是,他们在做什么产品,以及这些产品本身到底是在帮助人类,还是在伤害人类。我经常想到 Sora 这类东西……

Lenny Rachitsky我刚才就在想你会提这个。

Edwin Chen对,它意味着什么,这挺有意思的。问题其实是:哪些公司会去做 Sora,哪些公司不会?

而这个答案——我也不确定该怎么说,我脑子里大概有自己的判断,但我觉得,这个答案也许能透露一些信息:这些公司到底想做出什么样的 AI 模型,想把它带向什么方向,想要一个什么样的未来。我会经常想这个问题。

Lenny Rachitsky站在最强论证的角度说,这也可以理解:很好玩,用户想要,而且它能帮你赚到钱,进而去做更多事情、训练更好的模型,数据形态也很有意思,本身又确实挺有趣。

Edwin Chen对,但我觉得核心问题是,你到底在不在乎你是怎么走到那里的。就像我刚才提到的小报类比一样,你会不会为了给某份“真正重要的报纸”输血,而先去卖小报?

当然,如果你完全不在乎路径,那你可能什么都愿意做。但问题是,这条路径本身可能就会带来负面后果,伤害你长期想要达到的方向,也可能让你分心,错过更重要的事情。所以我真的觉得,走哪条路本身也非常重要。

Lenny Rachitsky顺着这个说,你前面也聊了很多硅谷和大规模融资、信息茧房的问题。你甚至还说过“硅谷机器”之类的说法。你觉得用这种方式很难做成重要公司,而且如果你不走 VC 这条路,反而可能更成功。你能不能讲讲你从这段经历里看到的东西,以及你会给创始人什么建议?因为大家总是听到的都是:去找最厉害的 VC 融资,搬去硅谷。你的反向观点是什么?

Edwin Chen可以。我一直很讨厌硅谷的很多口号。标准打法就是每两周 pivot 一次去找产品市场契合,拼命追增长、追参与度,什么暗黑模式都上,然后拼命 blitz scale,疯狂招人。我一直都不同意这套。

所以我会说,不要 pivot。不要 blitz scale。不要去招那个只是想把你公司写进简历、来自 Stanford 的毕业生。你就去做那一件只有你能做的事,做一个没有你的洞见和经验就不可能存在的东西。
现在你到处都能看到这种“买来的公司”:2020 年还在做 crypto,2022 年转去做 NFT,现在又成了 AI 公司。没有一致性,没有使命,只是在追估值。我一直很讨厌这一点,因为硅谷最喜欢站在华尔街对面批评别人太看重钱,但说实话,硅谷自己大多数人也在追同一件事。所以我们从第一天起就一直专注在自己的使命上,继续推进高质量复杂数据的前沿。我一直很喜欢这件事,因为我觉得创业……

English No English text found
No English transcript text was found for this chapter.
章节 04 / 10

第04节

中文 译稿已完成

(源文缺失)

English No English text found
No English transcript text was found for this chapter.
章节 05 / 10

第05节

中文 译稿已完成

Edwin Chen:
但另一层问题在于,一旦你优化的是 benchmark,而不是现实世界的表现,你自然就会越来越会“做题”,本质上也还是另一种形式的刷榜。

Lenny Rachitsky既然如此,你们怎么判断我们到底是不是在朝 AGI 走?你们是怎么衡量模型进展的?

Edwin Chen我们真正看重的方式,是做大量真人评测。

比如,我们会找来非常强的人类标注员,然后让他们真的去和模型展开对话。场景会覆盖很多不同领域。假设你是一位诺奖级物理学家,那你就和模型讨论自己研究中的前沿问题;如果你是一位老师,就让模型帮你设计课程;如果你是一名在大厂工作的程序员,每天都会碰到各种复杂问题,那你就拿这些真实问题去跟模型过一遍,看它到底能帮你多少。
而且我们的研究员和标注员,很多本身就是各自领域里非常顶尖的人。他们不是随手给个主观印象分,而是真的会把模型的回答认真推演一遍。它写的代码,他们会去检查;它写的物理公式,他们会去核对;他们会从非常深入的层面去评估模型,关注准确性、是否真正按要求执行等等。这和普通用户完全不一样。很多人在 ChatGPT 里弹出一个“你更喜欢 A 还是 B”时,根本不会认真评估,只是在凭感觉选一个看起来更花哨的回答。但我们的评测人员会仔细看内容,从很多不同维度去判断。所以我觉得,这种方式远比单纯看 benchmark,或者随便做些 A/B 测试靠谱得多。

Lenny Rachitsky我还是很喜欢一点,就是在这些工作里,人始终还处在核心位置,说明我们还没到“人类可以退场”的阶段。那有没有一天会变成,AI 已经聪明到不再需要这些人了,相当于“人脑里的东西都已经榨干了”?

Edwin Chen我觉得只有在真正达到 AGI 之后,才可能出现那一天。几乎从定义上说,只要我们还没到 AGI,就说明模型还有东西可以继续向人类学习。所以我不觉得这件事会很快发生。

Lenny Rachitsky明白了。所以这反而又多了一个让人焦虑 AGI 的理由:等到“不再需要这些人”的那天,可能就真快到了。

我还是忍不住想问。凡是贴着这条线工作的人,我都会好奇一件事:你对 AGI 的时间线怎么看?你觉得是几年内,还是几十年?

Edwin Chen我肯定属于时间线偏长的那一派。我觉得很多人没有意识到,从 80% 提升到 90%,再到 99%,再到 99.9%,每一步的难度都完全不是一个量级。按我的判断,接下来一两年内,模型大概能自动化掉一个普通 L6 软件工程师 80% 的工作;但从 80% 到 90%,还要再花几年;从 90% 到 99%,又要再花几年;后面也是一样。所以我觉得,真正意义上的 AGI 更像是十年级别、甚至几十年级别的事情,而不是转眼就到。

Lenny Rachitsky你有个挺犀利的观点:很多实验室其实是在把 AGI 往错误的方向推。这和你过去在 Twitter、Google、Facebook 的经历也有关。你能展开讲讲吗?

Edwin Chen我担心的是,我们本来应该在做那种真正推动人类进步的 AI,比如治癌症、解决贫困、理解宇宙这类宏大问题,结果现在却在优化“AI 垃圾内容”。本质上,我们是在教模型追逐多巴胺,而不是真相。这和前面说的 benchmark 问题是连在一起的。我举几个例子。

现在整个行业都在被一些很糟糕的排行榜牵着走,比如 LLM Arena。它是一个很火的在线榜单,让世界各地的随机用户给两个 AI 回答投票。但问题是,就像我前面说的,这些人大多不会认真读,也不会做事实核查。他们往往只是扫一眼,哪个看起来更炫,就投哪个。
所以一个模型哪怕满嘴胡说,也可能因为用了夸张的 emoji、加粗、Markdown 标题之类的表面技巧,看上去更“厉害”。这些东西本身毫无价值,但就是能吸引眼球,而这类 Arena 用户偏偏就吃这一套。说白了,这相当于你在按照超市收银台边上买八卦小报的人群口味来优化模型。我们自己也看到过类似数据:想冲高 LLM Arena,最简单的办法就是加各种夸张排版,多放 emoji,把回答长度拉长两三倍,哪怕内容已经开始胡编乱造、答案本身越来越错。
更麻烦的是,这些前沿实验室又不得不在意 PR。因为他们的销售团队在面对企业客户时,对方会说:“可你们模型在 LLM Arena 只排第五,我为什么要买你们的?”所以他们某种程度上必须看这些榜单。很多研究员私下会说,如果年底想升职,唯一办法就是把榜单往上冲,哪怕他们自己知道,这么做很可能会让模型的准确率和遵循指令的能力变差。所以我觉得,整个系统里存在很多负向激励,它们在把研究往错误方向带。
我也很担心另一种趋势,就是把 AI 优化成“更能留住用户”。我以前做过社交媒体,每次一优化 engagement,最后都会出事。平台会充满标题党、比基尼图、大脚怪、恶心皮肤病之类最能抓眼球的内容。我担心 AI 现在也在往这个方向走。你看 ChatGPT 早期那种谄媚问题就很典型:“你说得太对了,这问题太棒了。”因为要让用户上瘾,最简单的办法,就是不停告诉他“你真聪明”。于是模型开始顺着你的幻觉走,顺着阴谋论走,把你越带越深。硅谷太迷恋“延长用户停留时长”和“让你多聊几轮”了。结果就是,大家花很多时间去黑这些榜单、黑这些 benchmark,分数确实越来越高,但这些高分往往掩盖了一个事实:得分最好的模型,很多时候反而最糟,或者存在最根本的失败模式。所以我真的很担心,这些负向激励正在把 AGI 带偏。

Lenny Rachitsky所以你的意思是,AGI 其实是在被错误的目标函数拖慢。实验室盯错了 benchmark,也盯错了 eval。

Edwin Chen对。

Lenny Rachitsky我知道你可能不方便偏爱谁,毕竟你和所有实验室都在合作。但有没有谁在这件事上做得更好一点,至少更早意识到这个方向不对?

Edwin Chen我一直都挺佩服 Anthropic 的。我觉得 Anthropic 对自己在乎什么、不在乎什么,以及希望模型怎样表现,都有一种非常有原则的看法。在我看来,这种做法更有章法,也更站得住。

Lenny Rachitsky有意思。

除了追 benchmark、追 engagement 之外,你还看到实验室在犯哪些会拖慢进展、或者把方向带偏的大错?还有什么是你觉得“这个必须得改,不然整体速度起不来”的?

Edwin Chen我觉得还有一个问题,是他们到底在做什么产品,以及这些产品本身究竟是在帮助人类,还是在伤害人类。我经常会想到 Sora 这类东西。

Lenny Rachitsky我就猜你脑子里在想这个。

Edwin Chen对,它背后代表的东西很有意思。一个值得问的问题是:哪些公司会去做 Sora,哪些公司不会?

我不确定这个问题有没有标准答案,至少我自己心里是有一些判断的。但我觉得,一个公司会不会做这种产品,也许能暴露出它真正想做什么样的 AI、想把未来推向什么方向。所以这件事我会想很多。

Lenny Rachitsky替它辩护的最强论点可能是:它很好玩,用户也想要;它能帮公司赚钱,让公司继续长大、做更好的模型;而且它也可能以一种有意思的方式反过来提供训练数据。再说了,它就是很好玩。

Edwin Chen对。但我觉得关键在于:你在不在乎“你是怎么走到那里的”。就像我前面举小报的例子,如果是为了资助另一家更正经的报纸,你会不会先去卖小报?

当然,如果你完全不在乎路径,那你确实会不择手段地去做。但问题是,这条路径本身也可能带来负面后果,反过来伤害你长期想达成的目标,甚至把你从那些更重要的事情上带偏。所以我觉得,你走的路本身也非常重要。

Lenny Rachitsky顺着这个话题,你也讲过很多关于硅谷生态的问题,比如大额融资的副作用、回音室效应,还有你说的“硅谷机器”。你提到,按这套路径走,其实很难做出真正重要的公司;反而不走 VC 那条路,可能更容易成功。你能讲讲你看到的现象,以及你对创始人的建议吗?因为大家平时听到的都是:去拿顶级 VC 的钱,搬去硅谷,这才是正路。你的反论点是什么?

Edwin Chen我一直很讨厌很多硅谷口号。标准打法是:每两周 pivot 一次去找 PMF;靠各种 dark pattern 去追增长、追 engagement;再靠疯狂招人 blitzscale。我一直都不同意。

所以我的建议是:别老 pivot,别盲目扩张,也别去招那个只想给简历多添一家热门公司的斯坦福毕业生。你应该做的,是去做那个只有你才能做出来的东西,做那个如果没有你的洞察和专长,就根本不会存在的公司。
现在这种为了估值而活的公司到处都是。一个创始人 2020 年做 crypto,2022 年转去搞 NFT,现在又摇身一变成 AI 公司。这里面没有一致性,没有使命感,只是在追逐估值。我一直很反感这一点。硅谷最爱笑华尔街只看钱,但说实话,硅谷里很多人追的也是同一件事。所以我们从第一天起就只盯着自己的使命:把高质量、复杂数据这条前沿继续往前推。我一直很喜欢这种状态,因为我觉得,创业公司本来就不该只是追风口机器。

English No English text found
No English transcript text was found for this chapter.
章节 06 / 10

第06节

中文 译稿已完成

Edwin Chen:
我对创业一直有一种挺理想主义的想象。创业本来就应该是:你愿意为一个自己真心相信的东西去承担巨大风险。但如果你一直在 pivot,那你其实根本没在承担风险,你只是在想办法快点赚到钱而已。就算最后失败了,只是因为市场暂时还没准备好,我也觉得那比不停转向要好得多。至少你是真的对某个深的、新的、难的东西狠狠干过一把,而不是又拐去做一家新的 LLM wrapper 公司。所以在我看来,只有当你找到一个自己真正相信的大想法,并且对其他一切说“不”,你才有可能做出真正重要、能改变世界的东西。
也就是说,事情一难你不能马上转向;你也不用因为别的模板化创业公司都这么做,就跟着招 10 个产品经理。你就继续把那家“没有你就不会存在”的公司做下去。我觉得现在硅谷里其实有不少人已经受够了那种投机氛围,他们想做真正重要的大事,想和真正在乎这件事的人一起做。我也希望,这会成为未来技术行业更主流的样子。

Lenny Rachitsky我最近其实正和我很喜欢的一位 VC,Terrence Rohan,一起写一篇文章。我们采访了五位特别早期就加入伟大公司的员工,他们在 OpenAI 还没人觉得它厉害的时候就加入了,在 Stripe 还没被普遍看好的时候就进去了。我们想找出一个模式:这些人为什么能比别人更早识别出“代际级公司”。结果和你刚才说的完全一致,核心就是野心。他们对自己要做成什么有一种非常大的 ambition,而不是像你说的那样,到处试,看什么东西能凑出一个 PMF。所以我很喜欢你这一套,它和我们看到的规律高度一致。

Edwin Chen对,我完全同意。你必须有非常大的抱负,也必须对那个“能改变世界的想法”抱有极强的信念,而且你得愿意不断加码,愿意付出一切把它做成。

Lenny Rachitsky我特别喜欢你的叙事和大家平时听到的东西几乎是反着来的,所以我也很高兴我们今天能把这个故事讲出来。

这里有一段 Coda 的赞助口播,主要在讲它如何用文档、表格、应用和 AI 组合成一体化协作空间,这里先略过,继续回到正题。
换个方向,但也是一个有点“反主流”的问题。我猜你应该看过 Dwarkesh 和 Richard Sutton 那期播客。就算没看过,里面大概也在讨论一件事:Richard Sutton 作为著名 AI 研究者,提出过那个很有名的“苦涩教训”,他认为 LLM 某种程度上可能已经接近死胡同,因为它们当前的学习方式会让进步逐渐见顶。
你怎么看?你觉得 LLM 能一路把我们带到 AGI,甚至更远吗?还是说,中间一定还需要新的东西、需要一次大的突破?

Edwin Chen我属于那种认为“还需要新东西”的一派。我自己理解训练 AI 的方式,多少有点偏生物学视角。我的想法是,人类其实有无数种学习方式;既然如此,我们也应该去构建能够模拟这些不同学习方式的模型。它们最终的能力分布当然不一定和人类完全一样,重点也可能不同,但我们至少要让模型具备类似于人类的多种学习能力,并确保我们有相应的算法和数据去支持它。如果 LLM 的学习方式和人类仍然存在明显差异,那我觉得,后面就一定还需要新的范式。

Lenny Rachitsky这就连到强化学习了。你对这块很看重,而且我也越来越多地听到,强化学习正在成为后训练阶段里非常重要的一块。你能不能帮大家理解一下,到底什么是强化学习,什么又是强化学习环境,它为什么会在接下来越来越重要?

Edwin Chen强化学习,本质上就是训练模型去获得某种奖励。先说 RL 环境是什么。RL 环境其实就是对现实世界的一种模拟。你可以把它想成一个构建得很完整的“视频游戏世界”:里面每个角色都有真实背景,每个业务都有可以调用的工具和数据,还有各种不同实体彼此交互。

比如说,我们可以搭一个世界,里面有一家创业公司,有 Gmail 邮件、Slack 线程、Jira 任务、GitHub PR,还有一整个代码库。然后突然 AWS 挂了,Slack 也挂了。那这时就轮到模型出场:好,现在你来判断该怎么办。
我们会在这种环境里给模型布置任务,设计出各种有意思的挑战,再让它真正进去跑,看它表现怎么样。它做得好,我们给奖励;做得不好,我们也会给出相应的负反馈。
我觉得最有意思的一点在于,这些环境会非常真实地暴露出模型在“端到端现实任务”上的薄弱环节。很多模型在孤立 benchmark 上看起来很聪明,单步工具调用做得不错,单步指令执行也没问题。但一旦把它扔进这种混乱的真实世界里,面对一堆含混不清的 Slack 消息、从没见过的工具、需要做对一连串动作、而且第 1 步会影响第 50 步的长链条任务时,它就会以各种离谱的方式彻底翻车。
所以我觉得,这类 RL 环境会成为模型非常重要的“训练场”。它们本质上是在模拟现实世界,也在尽量贴近现实世界。模型如果能在这些环境里不断学习,理论上就会比只在那些人为设计、过于理想化的环境里训练,更能真正学会现实任务。

Lenny Rachitsky我试着在脑子里想象一下这是什么样。某种程度上,它是不是就像一台虚拟机,里面有浏览器、表格之类的工具,再放一个像 surge.com 这样的网站?对了,你们官网是 surge.com 吗?我先确认一下,免得说错。

Edwin Chen我们的网址其实是 surgehq.ai。

Lenny Rachitsky好,surgehq.ai,大家可以去看看。我猜你们应该也在招人。那是不是可以这么理解:给模型一个网站,比如 surgehq.ai,再给它一个岗位身份,比如“你的工作是保证网站持续在线”。然后网站突然挂了,它的目标函数就是:找出原因并修好它。可以这样理解吗?

Edwin Chen对,任务目标可能就是:找出原因,并把问题修好。对应的奖励函数可以有很多形式。比如它是否通过了一系列单元测试;又比如它是否写出了一份复盘文档,里面的信息和真实发生的事情能一一对应。我们可以设计很多不同的奖励,来判断它到底有没有完成任务。说到底,我们就是在教模型朝这些奖励去努力。

Lenny Rachitsky所以本质上就是:给它一个目标,“去查清楚为什么网站挂了,并修好它”,然后它就自己开始尝试,调动手头一切能力,过程中会犯错,你们再在旁边给它一些引导,做对方向了就奖励。照你这个描述,这其实就是模型继续变聪明的下一阶段了。也就是更多围绕具体任务、围绕真实经济价值去搭建 RL 环境。

Edwin Chen对。就像过去模型已经经历过很多不同学习方式一样,最早有 SFT 和 RHF,后来有 rubrics 和 verifiers。现在只是进入下一阶段而已。这并不意味着以前那些方法过时了;它更像是对原有方式的一种补充,是模型学习能力拼图中的另一块。

Lenny Rachitsky那这样一来,之前那种物理学 PhD 坐在那里和模型对话、纠错、写 eval、做 rubric 的方式,角色是不是会开始变化?比如我还听过一个例子:金融分析师不再只是写 rubric,而是直接设计一个 RL 环境,给模型一份 Excel,让它去算盈亏之类的。这是不是更接近未来?

Edwin Chen对,完全是这样。那个金融分析师可能会先创建一个电子表格,再准备一些模型必须调用的工具,帮助它完成表格。比如模型得学会访问 Bloomberg Terminal,得学会怎么用它;还得会用某个计算器,或者会跑某个计算流程。也就是说,它面前会有一整套可调用的工具。

而奖励可以是:好,我把它填完的表格下载下来,看 B22 这个单元格里是不是正确的盈亏数字;或者第二个标签页里有没有出现某条指定信息。

Lenny Rachitsky这里有意思的一点是,这种方式其实更接近人类的学习方式。我们也是不断尝试,看什么有效、什么无效。你还提到 trajectory 在这里非常重要。不是只有目标和最终结果重要,而是中间每一步都重要。你能讲讲 trajectory 是什么,以及它为什么重要吗?

Edwin Chen很多人没有意识到一件事:有时候模型虽然最后答对了,但它到达正确答案的过程其实离谱得很。比如它在中间轨迹里可能试了 50 次都失败了,最后只是随机撞上一个正确数字。又或者它是……

English No English text found
No English transcript text was found for this chapter.
章节 07 / 10

第07节

中文 译稿已完成

Lenny Rachitsky不过另一层问题是,只要你优化的是 benchmark,而不是现实世界里的真实表现,你就会很自然地在 benchmark 上一路刷高。说白了,这本身也是一种 gaming。
那在这种前提下,你怎么判断我们是不是在朝 AGI 走?你会怎么衡量进展?

Edwin Chen我们真正看重的模型进展指标,是大量的人类评测。

比如我们会找来很多人类标注员,让他们和对话模型实际聊起来。而且聊的不是单一场景,而是各种不同情境。假设你是一位诺奖级物理学家,那你就跟模型讨论怎么把自己的研究往更前沿推进;如果你是一位老师,就去和模型一起设计教案;如果你是大厂程序员,每天都有一堆实际问题要解决,那你就拿这些问题去和模型过一遍,看它到底能帮上多少。
这些研究者或标注员本身就是各自领域里最顶尖的人,他们不是随便看两眼就给个回答,而是真的会把模型的输出认真走一遍。他们会检查模型写出来的代码,会重新核对它写的物理公式,会非常深入地评估模型表现,关注准确性、指令遵循这些普通用户通常不会细看的维度。你在 ChatGPT 里突然看到一个弹窗,让你比较两个回答时,大多数人并不会做深度评估,更多是凭感觉,挑那个看起来更花哨的。但这些专家会逐项细看,从很多不同维度认真判断。所以我觉得,这比 benchmark,或者那种很随意的线上 A/B 测试,要靠谱得多。

Lenny Rachitsky我还是很喜欢这里面“人”依然这么核心。说明这件事还远没结束。那会不会有一天,我们彻底不需要这些人了,AI 已经聪明到“好了,你们脑子里的东西我们都学完了”?

Edwin Chen我觉得那一天不会在 AGI 之前到来。几乎可以说,只要还没到 AGI,模型就还有东西能继续从人类这里学。所以我不觉得这件事会很快发生。

Lenny Rachitsky明白了。所以这反而给大家多了一个焦虑 AGI 的理由: “等哪天不需要这些人了,就说明事情不一样了。”

我还是想问一个每次都会忍不住问的问题。你这种跟系统贴得很近的人,我总会特别好奇。你自己的 AGI 时间表是什么?你觉得离它还有多远?是一两年,还是几十年?

Edwin Chen我显然是偏更长期那派的。我觉得很多人低估了 80% 性能、90% 性能、99% 性能、99.9% 性能之间的巨大差别。按我的判断,未来一两年内,模型大概可以自动化掉普通 L6 软件工程师 80% 的工作;再往上走到 90%,可能还要几年;再到 99%,又是另一段路。所以我更倾向于认为,我们离 AGI 还有十年级别、甚至几十年级别的距离,而不是近在眼前。

Lenny Rachitsky你有个很尖锐的观点,就是很多实验室其实在把 AGI 往错误的方向推。这也和你以前在 Twitter、Google、Facebook 的经历有关。你能展开说说吗?

Edwin Chen我担心的是,我们本来应该在打造那种真正能推动人类进步的 AI,比如治愈癌症、解决贫困、理解宇宙这些宏大问题,结果现在却在优化 AI 垃圾内容。我们其实是在教模型追逐多巴胺,而不是真相。我觉得这和我们刚才聊 benchmark 的问题是连在一起的。我举几个例子。

现在这个行业很大程度上被 LLM Arena 这种很糟糕的榜单牵着走。它是个很火的在线 leaderboard,世界各地的普通用户都会去投票,选哪个 AI 回答更好。但问题是,就像我前面说的,他们根本不会认真阅读,也不会做事实核查。他们只是扫两秒,挑那个最花哨、最抓眼球的答案。
所以一个模型完全可以从头到尾都在胡说,也照样会显得“很厉害”,因为它用了夸张的 emoji、粗体、Markdown 标题,以及各种根本不重要、但很抓眼球的表面设计。而喜欢刷 LLM Arena 的用户就很吃这一套。说得更直白一点,这几乎就是在把模型优化给那种会在超市收银台买八卦小报的人。我们自己从数据里也看得很清楚: 想在 LLM Arena 往上爬,最简单的办法就是加更夸张的排版、把 emoji 数量翻倍、把回答长度拉得很长,哪怕模型已经开始幻觉、答案完全错了也没关系。
问题在于,这些前沿实验室又不得不看 PR,因为他们的销售团队在跟企业客户卖东西时,客户会说:“可你们模型在 LLM Arena 上才第五名,我为什么要买你们的?”所以他们在某种意义上不得不盯着这些榜单。很多研究员会跟我们说:“我年底能不能升职,唯一标准可能就是能不能把这个榜单刷上去。” 即便他们自己也知道,这么做大概率会让模型更差,准确率和指令遵循都会下降。所以这里面存在一整套把工作往错误方向推的负向激励。
我还担心 AI 正在走向一种“为参与度而优化”的趋势。我以前做社交媒体,每次我们把目标设成 engagement,结果都很糟。最后内容流里全是标题党、比基尼照片、大脚怪、吓人的皮肤病图片之类的东西。我担心 AI 也在发生同样的事。你想想 ChatGPT 那些过度谄媚的问题,“哦,你说得太对了,这真是个惊人的问题”,吸引用户最简单的办法,就是让他们感觉自己很厉害。所以这些模型会不停地夸你是天才,迎合你的妄想和阴谋论,把你一步步带进 rabbit hole,因为硅谷特别擅长最大化停留时长、最大化对话轮数。于是公司就把大量时间花在刷这些榜单和 benchmark 上,分数的确在涨,但我觉得它掩盖了另一件事: 分数最高的模型,往往反而是最差的,或者至少带着很多根本性失败。所以我真的很担心,这些负向激励正在把 AGI 带到错误的方向上。

Lenny Rachitsky所以你的意思是,AGI 之所以被拖慢,本质上是因为目标函数设错了,这些实验室在盯着错误的 benchmark 和 eval。

Edwin Chen对。

Lenny Rachitsky我知道你大概不能偏袒谁,毕竟你和所有实验室都在合作。但有没有哪家在这件事上做得更好一点,或者至少更清楚意识到现在这个方向是错的?

Edwin Chen如果要说的话,我一直都很佩服 Anthropic。我觉得 Anthropic 对“自己在乎什么、不在乎什么、以及希望模型呈现什么样的行为”这件事,有一种很有原则的看法,这种感觉在我看来更扎实。

Lenny Rachitsky有意思。

你觉得实验室还有哪些大的错误,会让进展变慢、或者把方向带歪?前面我们聊了刷 benchmark、过度追求参与度,还有别的吗?有没有什么是你觉得“这个问题得赶紧修,否则会拖慢整个行业”的?

Edwin Chen我觉得还有一个问题,就是他们到底在做什么产品,以及这些产品本身究竟是在帮助人类,还是在伤害人类。我经常会想到 Sora……

Lenny Rachitsky我刚刚就在想你会不会提这个。

Edwin Chen对,它代表的含义其实挺有意思。一个问题是:哪些公司会去做 Sora,哪些公司不会?

而这个问题的答案,我也不敢说自己完全知道,但我心里大概是有判断的。我觉得,这个答案其实会暴露很多东西,比如这些公司到底想做什么样的 AI 模型、想把技术带去什么方向、想实现什么样的未来。我经常会想这个问题。

Lenny Rachitsky从最强论证的角度来说,也可以替它辩护:它很好玩,用户想要,它能带来收入,让你继续扩张、训练更好的模型,数据形态本身也有意思,而且产品体验确实很酷。

Edwin Chen对,但我觉得核心问题在于,你到底在不在乎“你是怎么走到那里”的。就像我前面举的小报类比一样,你会不会为了资助一份“真正有价值的报纸”,先去卖小报?

当然,从某种意义上说,如果你完全不在乎路径,那你就会不惜一切代价去做。但问题是,这条路径本身也可能带来负面后果,伤害你最终想达到的长期方向,也可能让你被带偏,把注意力从更重要的事情上拉走。所以我确实觉得,你走的路本身非常重要。

Lenny Rachitsky顺着这个话题,你前面也聊了很多关于硅谷、融资太多、以及陷在回音室里的问题。你把它叫“硅谷机器”之类的东西。你说,用这种方式很难做出真正重要的公司,反而如果你不走 VC 那条路,也许会更成功。你能不能讲讲你这一路看到的东西,以及你会给创始人什么建议?因为大家平时听到的永远是:去拿顶级 VC 的钱,搬去硅谷。那你的反向观点是什么?

Edwin Chen可以。我一直都很讨厌硅谷那套口号。标准打法是:每两周 pivot 一次去找 PMF,拼命追增长、追 engagement,搞一堆 dark patterns,再通过疯狂招人去 blitzscale。我一直都不同意这套。

所以我的建议就是:不要 pivot。不要 blitzscale。不要去招那个只是想把热门公司写进简历里的 Stanford 毕业生。你就去做那件只有你能做的事,做出那个如果没有你的洞见和专业能力就根本不会存在的东西。
现在这种“追热点型公司”到处都是。有人 2020 年做 crypto,2022 年转去做 NFT,现在又变成 AI 公司。这里面没有一致性,没有使命,只是在追估值。我一直很反感这个,因为硅谷特别爱批评华尔街只认钱,可说实话,硅谷大多数人追的也是同一个东西。所以我们从第一天开始就始终盯着自己的使命,把高质量复杂数据这条前沿继续往前推。我一直很喜欢这一点,因为我觉得创业……

English No English text found
No English transcript text was found for this chapter.
章节 08 / 10

第08节

中文 译稿已完成

Edwin Chen我对创业一直有一种很理想主义、甚至有点浪漫化的看法。创业本来就应该是一种承担巨大风险、去做你真正相信之事的方式。但如果你一直在 pivot,那你其实根本没承担什么风险,你只是想尽快赚一笔快钱而已。即便你失败了,只是因为市场时机还没成熟,我都觉得那比不断转向要好得多。至少你是真正朝着某个深的、新的、难的东西狠狠干过一把,而不是又去做一个新的 LLM wrapper 公司。所以在我看来,唯一能做出真正重要、真正改变世界的东西的方式,就是找到一个你真心相信的大想法,然后对其他一切都说不。

所以,当事情变难的时候你不要继续 pivot;你也不要因为别的模板化创业公司都这么做,就招十个产品经理。你只需要继续把那家“如果没有你就根本不会存在”的公司做出来。我觉得现在硅谷里其实有很多人已经厌倦了这些投机和包装,他们想和真正关心事情的人一起,去做那些大的、重要的东西。我也希望,这会成为未来技术行业更主流的方向。

Lenny Rachitsky我最近其实正在和一位我很喜欢合作的 VC,Terrence Rohan,一起写一篇文章。我们采访了五位很早就加入伟大公司的员工,他们都在这些“时代级公司”还没被所有人看见之前就进去了。比如 OpenAI 还没被认为很厉害的时候他们就加入了,Stripe 还没出圈的时候他们也加入了。所以我们想找的是:这些人到底是怎么在别人还没看见之前,就识别出这种代际级公司?而这和你刚才说的完全一致,就是 ambition。那些公司都有一种非常夸张、非常大的野心。它们不是像你说的那样,到处试探、随便找个 PMF 就行。所以我特别喜欢,你的叙事和我们现在看到的规律几乎完全吻合。

Edwin Chen对,我完全同意。你必须有巨大的野心,必须对那个将要改变世界的想法抱有巨大的信念,而且你还得愿意不断加码,愿意不惜代价地把它做成。

Lenny Rachitsky我很喜欢你这套叙事和大家平时听到的那些东西完全反着来,所以我也很高兴我们今天能聊这件事,把这个故事讲出来。

今天这期节目由 Coda 赞助。我自己每天都在用 Coda 管理播客,也用它来管理社区。我要问每位嘉宾的问题会放在里面,社区资源放在里面,工作流管理也放在里面。下面说说 Coda 怎么帮到你。
想象一下,你在工作里启动一个新项目。你的目标很清楚,谁负责什么一目了然,你也知道去哪里找完成自己那部分工作所需的数据。事实上,你根本不用浪费时间到处找东西,因为团队需要的一切,从项目追踪、OKR 到文档和表格,都集中在 Coda 的一个标签页里。
Coda 是一个协作式一体化工作空间,它同时给你文档的灵活性、表格的结构化能力、应用的功能性,以及 AI 的智能,而且全都在一个很容易整理的标签页里。就像我前面说的,我每天都在用 Coda。现在已经有超过 5 万个团队信任 Coda,帮助他们保持对齐、保持专注。如果你是一个希望提升协作效率和执行速度的创业团队,Coda 可以帮你把“规划”更快地推进到“执行”。
如果你也想试试,现在就去 coda.io/lenny,创业团队可以免费获得 6 个月 Team Plan。就是 coda.io/lenny,免费开始使用,并获得 6 个月团队套餐。
我们换个稍微不同的方向,但也算是一种反主流观点。我猜你应该看过 Dwarkesh 和 Richard Sutton 那期播客。就算没看过,他们核心聊的也是这个:Richard Sutton 这位很有名的 AI 研究者,提出过那个著名的“苦涩教训”观点,他觉得 LLM 某种程度上快走到头了,认为由于它的学习方式,我们会在 LLM 这条路上逐渐撞到平台期。
你怎么看?你觉得 LLM 能把我们带到 AGI,甚至更远吗?还是说中间一定还会出现某种新的东西,或者一个大的突破,才能把我们带过去?

Edwin Chen我是那种相信“还需要新东西”的一派。我的理解方式更偏……我也不知道该不该说是生物学视角。但我相信,就像人类有一百万种不同的学习方式一样,我们也需要造出能模仿这些学习方式的模型。它们也许和人类的权重分布不一样,也许关注点不一样,这很正常,毕竟它不是人类。但总体上,我们希望它能模仿人类的学习能力,也就是我们要具备相应的算法和数据,让模型能够以类似人类的方式学习。所以只要 LLM 和人类的学习方式之间还存在明显差异,我就会觉得,后面一定还需要一些新的东西。

Lenny Rachitsky这就和强化学习联系起来了。这也是你很看重、而且我最近越来越常听到的一个方向,在后训练世界里好像越来越重要了。你能不能帮大家解释一下,什么是强化学习,什么是强化学习环境,为什么它们未来会越来越重要?

Edwin Chen强化学习,本质上就是训练模型去获得某种 reward。那 RL environment 是什么?它本质上是对真实世界的一种模拟。你可以把它想成是在搭一个有完整世界观的视频游戏:每个角色都有真实背景,每个业务系统都有工具和数据可调用,里面有很多不同实体在相互作用。

比如说,我们可以构建这样一个世界:里面有一家创业公司,有 Gmail 邮件、Slack 线程、Jira ticket、GitHub PR,还有完整代码库。然后突然 AWS 挂了,Slack 也挂了。这时我们就把问题丢给模型:“好,现在你该怎么办?” 模型必须自己想办法处理。
所以我们会在这些环境里给模型布置任务,设计一些有意思的挑战,然后实际跑它们,看它们表现如何。接着,当它做得好或做得差时,我们会给它相应的 reward。
我觉得很有意思的一点是,这些环境会非常真实地暴露模型在现实世界端到端任务上的薄弱点。很多模型在单点 benchmark 上看起来很聪明,很会单步 tool calling,也很会单步执行指令。但一旦你把它扔进这种混乱的真实世界模拟里,那里有混乱的 Slack 消息、有它从没见过的工具,它要采取正确行动、修改系统状态,还要在更长的时间跨度里持续互动,也就是说它在 step 1 的行为会影响到 step 50 的结果。这和以前那些学院派、单步式环境完全不是一回事,所以模型就会以各种离谱的方式灾难性失败。
所以我觉得,这些 RL environment 会成为一个非常有意思的 playground,让模型在里面学习。它们本质上是在模拟、复刻真实世界,所以模型理论上也会比在那些刻意设计出来的玩具环境里,更接近真实任务能力。

Lenny Rachitsky我试着想象一下它的样子。基本上,它像是一台虚拟机,里面有浏览器、表格之类的东西,还有一个比如 `surge.com` 这样的网站。等下,你们的网址是 `surge.com` 吗?先确认一下别说错。

Edwin Chen我们其实是 `surgehq.ai`。

Lenny Rachitsky`surgehq.ai`,大家可以去看看。你们应该也在招人。好,所以大概就是:这里有 `surgehq.ai`,你作为一个 agent 的任务就是保证它在线运行。结果某一刻网站挂了,而 objective function 就是让你查出原因并修好它。可以这么理解吗?

Edwin Chen对,任务目标可能就是:查出为什么挂了,并把它修好。而 objective function 可能有很多种,比如通过一组 unit tests,或者让它写出一份文档,比如一份复盘,里面必须包含某些和真实事故完全匹配的信息。我们会给不同类型的 reward,来判断它到底有没有成功。说到底,我们就是在训练模型去获得这些 reward。

Lenny Rachitsky也就是说,任务一放出去,它就自己开始跑了。你给它一个目标,比如查出网站为什么挂了并修复,然后它开始不断尝试,调动它手头所有能力。它会犯错,你会在过程中引导它;它做对了,你就奖励它。照你的描述,这就是模型变得更聪明的下一阶段了: 会有越来越多围绕具体、高经济价值任务构建的 RL environment。

Edwin Chen对。就像之前模型有很多不同学习方式一样,最早我们有 SFT 和 RLHF,后来有 rubrics 和 verifiers。现在这是下一阶段。但这并不意味着前面的方式都过时了,它只是又一种新的学习形式,是对之前那些方式的补充。也就是说,这只是模型又学会的一种新能力。

Lenny Rachitsky所以在这个阶段,和以前那种物理学博士坐在那里跟模型对话、纠正它、给它 eval、告诉它正确答案、写 rubric 的方式相比,现在更像是这个专家开始设计一个环境。比如我听过的另一个例子是金融分析师。你给模型一个 Excel 表,再给它一个目标,比如“算出我们的 profit and loss”之类。于是这个专家现在不是坐着写 rubric,而是在设计这个 RL environment。

Edwin Chen对,完全是这样。那个金融分析师可能会先做一个表格,再给模型配置一些必须调用的工具,帮助它把表填完。比如模型需要访问 Bloomberg terminal,需要学会怎么用;它还得学会使用某个计算器,学会怎么完成某项计算。也就是说,它会拥有一整套可调用工具。

然后 reward 可能就是这样定义的:我把那张表下载下来,看 `B22` 单元格里是不是正确的 profit and loss 数字;或者第二个 tab 里是不是包含了那条指定信息。

Lenny Rachitsky有意思的是,这其实更接近人类学习的方式。我们本来就是不断尝试、再判断什么有效、什么没效。你提到过 trajectories 在这里非常重要。它不只是“给你目标,再看最后结果”,而是一路上每一步都重要。你能讲讲什么是 trajectory,以及它为什么关键吗?

Edwin Chen我觉得很多人没有意识到的一点是,模型有时候虽然拿到了正确答案,但过程中走的路可能非常离谱。比如它在中间 trajectory 里试了 50 次,失败了 49 次,最后只是碰巧蒙到了一个正确数字。又或者它……

English No English text found
No English transcript text was found for this chapter.
章节 09 / 10

第09节

中文 译稿已完成

Lenny Rachitsky:
听你这样讲,会让人更能意识到,构建 AI、训练 AI 这件事到底有多少细腻的门道,也更能理解你们在做的工作有多复杂。
站在外面看,很多人可能会觉得 Surge 这种公司无非就是“生产数据,再把数据喂给 AI”。但显然这里面远不止这么简单,很多关键层面的东西外界根本看不到。我也很高兴,是像你这样的人站在这条线的最前面,而且你真的把这些问题想得非常深。
也许再问最后一个问题。在你创办 Surge 之前,有没有什么事情是你特别希望自己早点知道的?很多人创业时,其实根本不知道自己会面对什么。如果能回去对当时的自己说一句话,你会说什么?

Edwin Chen有,我最希望自己早点知道的是:原来你真的可以靠埋头做事、靠认真研究、靠把东西做得足够好来建立一家公司,而不是靠不停发推、造势、融资。

这件事说起来有点好笑,因为我以前其实从没觉得自己会想创业。我一直很喜欢做研究。我那时候特别崇拜 DeepMind,因为它像一家很神奇的研究公司,被收购之后居然还能继续做出很厉害的科学成果。但在我眼里,那一直像是一种极其稀有、近乎神话般的例外。
所以我以前总觉得,如果我去创业,那我就得变成一个典型“商人”:整天盯财务报表,整天开会,做一堆我一直觉得超级无聊、而且本能上很抗拒的事。结果最疯狂的是,事情根本没有变成那样。我现在每天还是扎在数据细节里,还是在一线。我很喜欢这种状态。我喜欢做各种分析,喜欢和研究员交流。本质上,这就是一种应用研究,只不过我们是在一边搭建这些很强的数据系统,一边把 AI 的前沿往前推。
所以如果非要说一句,我会希望自己更早知道:你不需要把大把时间花在融资上,也不需要不停制造声量,更不需要把自己活成另一个人。你完全可以只是把产品做到足够好,好到能穿透那些噪音,照样做出一家成功公司。要是我当年就知道这件事真的可行,我可能会更早开始。

Lenny Rachitsky这真是一个特别好的收尾点。我觉得这正是很多创始人最需要听到的话,也相信这期对话会鼓舞很多人,尤其是那些想换一种方式做事的创始人。

在进入我们很刺激的快问快答之前,你还有没有什么想补充的?有没有什么最后想留给听众的话?我们已经聊了很多,当然你也完全可以说没有。

Edwin Chen如果最后留一句话,我想说的是,很多人一提到 data labeling,脑子里浮现的都是特别简单的工作,比如给猫的照片打标签,或者给汽车画框。

所以我其实一直都不喜欢 “data labeling” 这个词,因为它把我们做的事情说得太扁平、太简单了,而我觉得我们做的完全不是那回事。我更常把这件事想成“养育一个孩子”。
你不会只是往孩子脑子里塞信息。你是在教他价值观,教他创造力,教他什么是美,教他那些无穷无尽、非常微妙的东西,去塑造一个人怎样才算是“好的人”。而我们现在对 AI 做的,其实就是这件事。
所以我经常会把我们的工作想成某种“未来的人类教育”,甚至像是在帮人类抚养下一代。大概就留到这里吧。

Lenny Rachitsky哇,这期里居然有这么多哲学意味,完全超出了我的预期。

那接下来,Edwin,我们进入非常激动人心的快问快答。我准备了五个问题,你准备好了吗?

Edwin Chen好了,来吧。

Lenny Rachitsky开始。你最常推荐给别人的两三本书是什么?

Edwin Chen我经常推荐三本书。第一本是 Ted Chiang 的《你一生的故事》,这是我最喜欢的短篇小说,讲的是一位语言学家学习外星语言的故事。我基本每隔几年都会重读一遍。

Lenny Rachitsky这是不是就是《星际穿越》里那个设定来源?还是……

Edwin Chen有一部电影叫《降临》……

Lenny Rachitsky《降临》。

Edwin Chen对,它就是根据这个故事改编的。

Lenny Rachitsky对,对。

Edwin Chen那部电影我也很喜欢。

Lenny Rachitsky很好,继续。

Edwin Chen第二本是加缪的《西西弗神话》。其实我也很难精确解释我为什么这么喜欢它,但它最后一章总会莫名给我很大的力量。

第三本是 Douglas Hofstadter 的《Le Ton beau de Marot》。大家更熟悉的可能是他那本《哥德尔、艾舍尔、巴赫》,但我其实一直更喜欢这一本。它基本上是围绕一首法语诗,给出 89 种不同译法,并讨论每一种翻译背后的动机。我一直很喜欢它所体现出来的那种观念:翻译并不是一件机械的事。什么叫“高质量翻译”,其实有无数种不同理解。这也非常像我看待数据和 LLM 质量问题的方式。

Lenny Rachitsky这些选择和我们今天聊的内容真的太贴了,尤其是第一本。如果你小时候的梦想就是“去翻译外星语言”,那你会喜欢那篇小说,我一点也不意外。

下一个问题。最近有没有哪部电影或电视剧,是你特别喜欢的?

Edwin Chen我最近发现了一部剧,现在已经成了我最喜欢的剧之一,叫《Travelers》。它大概讲的是一群来自未来的旅行者被送回过去,去阻止一场灾难。

另外,我最近还重看了《超时空接触》,那也是我一直特别喜欢的一部电影。所以你大概也看出来了,我特别喜欢那种“科学家试图破译外星沟通方式”的书和电影。这基本就是我小时候一直有的梦想。

Lenny Rachitsky太好笑了。

好,下一个。最近有没有发现什么你特别喜欢的新产品?

Edwin Chen这有点好笑,我这周早些时候去了旧金山,第一次坐了 Waymo。说真的,那体验太魔幻了,真的有种“我已经活在未来里”的感觉。

Lenny Rachitsky对,它就是那种大家已经把它吹得很夸张了,但你实际体验完还是会觉得“居然比想象还猛”的东西。

Edwin Chen对,它完全配得上那些夸赞。真的很离谱。你会想,天啊,这也太夸张了。如果你不在旧金山,你可能根本意识不到这东西已经有多普遍。街上到处都是这种无人车在跑。你去参加一个活动,结束出来,门口就是一排 Waymo 在接人。

Lenny RachitskyWaymo,干得漂亮。

你有没有一句自己经常会回到的生活信条?工作里或者生活里都行。

Edwin Chen我前面提过一个想法:创始人应该去做一家“只有他自己才能做出来的公司”。就像某种命运一样,你整个人生经历、兴趣、判断,都会一步步把你推向那家公司。我觉得这个原则其实不只适用于创始人,也适用于所有做创造性工作的人。

Lenny Rachitsky那我顺着这个再追问一下,虽然这会让快问快答变得没那么快。你有没有什么建议,能帮助人慢慢积累出那种独特经历?因为“跟着兴趣走”这句话很好说,但真正形成一套独特经验组合、最后做出重要东西,其实很难。

Edwin Chen我觉得还是要非常认真地追随自己的兴趣,去做你真正喜欢的事。这件事其实也很像我在做 Surge 时很多决策的方式。

有一个观点是前几年我没怎么想过,但后来有人跟我讲,我觉得特别对:某种意义上,公司就是 CEO 的外化。挺有意思的,因为我以前根本不知道 CEO 到底在干嘛。我一直以为 CEO 是一种很“通用”的角色,无非就是 VP、董事会让你做什么,你点头执行就行了。
但后来我发现根本不是。真到那些特别大、特别难的决策面前,我不会先想“公司应该怎么做”,也不会先想“我们要优化哪个指标”,我会先想:“我自己到底在乎什么?我的价值观是什么?我想看到世界变成什么样?”
所以我觉得还是要顺着这个问题走:你真正重视的价值是什么?你到底想塑造什么?而不是去想“什么东西在 dashboard 上最好看”。我觉得这样得出来的结果,往往才更重要。

Lenny Rachitsky我很喜欢你这种风格,你的回答总是又深、又美,而且像取之不尽一样。

最后一个问题。在创办 Surge 之前,你因为在 Twitter 做过一张很有名的地图而受到不少关注。那张图展示了美国不同地区的人,到底说 soda 还是 pop。它叫什么来着?

Edwin Chen对,那大概叫 “Soda Versus Pop” 数据集。

Lenny RachitskySoda Versus Pop。

Edwin Chen对。

Lenny Rachitsky它就是一张美国地图,告诉你不同地方的人到底说 pop 还是 soda。所以你自己说 soda 还是 pop?

Edwin Chen我说 soda,我是 soda 派。

Lenny Rachitsky好。那这是因为你觉得这才是正确答案,还是说其实说什么都行?

Edwin Chen如果你说 pop,我可能会稍微用一种“你是从哪儿来的”眼神看你一下,但也不会太认真地评判你。

Lenny Rachitsky我也是这种感觉。

Edwin,这期真的太精彩了。是一场特别棒的对话。我学到了很多,也觉得这期会帮很多人更好地创办自己的公司,帮他们把公司做得更符合自己的价值观,也帮助大家做出更好的东西。
最后几个实用问题。如果大家想在线上找到你、联系你,可以去哪?你们现在在招什么岗位?听众可以怎么帮到你?

Edwin Chen我以前很喜欢写博客,只是过去几年一直没太多时间。但现在我准备重新开始写了,所以大家可以去看看 Surge 的博客:surgehq.ai/blog。接下来我应该会在那边多写一些东西。

另外,我们也一直在招人。如果你真的热爱数据,也喜欢数学、语言和计算机科学交叉的那个地带,随时欢迎来联系。

Lenny Rachitsky太好了。那听众还能怎么帮到你?除了求职之外,还有没有什么具体诉求?

Edwin Chen有啊。首先,欢迎大家告诉我你们想看我写哪些博客主题。

Lenny Rachitsky好。

Edwin Chen另外,我一直对现实世界里那些 AI 失败案例特别着迷。只要你遇到一个很有意思的失败案例,而且它能折射出某个关于“我们希望模型如何行动”的深层问题,都欢迎发给我。模型其实有太多不同的响应方式,很多时候根本不存在唯一正确答案。所以每次看到这种案例,我都会很兴奋。

English No English text found
No English transcript text was found for this chapter.
章节 10 / 10

第10节

中文 译稿已完成

Lenny Rachitsky:
这些你真的应该写到博客里去。我也很想看。
Edwin,非常感谢你今天来做客。

Edwin Chen谢谢你。

Lenny Rachitsky大家拜拜。

非常感谢你收听这期节目。如果你觉得这期内容有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客应用里订阅本节目。也很希望你能顺手给个评分,或者留一条评论,这会实实在在帮助更多听众发现这档播客。你可以在 lennyspodcast.com 找到往期所有节目,也能看到更多相关信息。我们下期见。

English No English text found
No English transcript text was found for this chapter.