Transcript Reader Lenny's Podcast
Library
Builder transcript 中文已完成

Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Read the source conversation in a calm, mobile-friendly layout.

ChannelLenny's Podcast
Language中文
SourceYouTube
Coverage100%
0% 章节 01
Video Source Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Lenny's Podcast

https://www.youtube.com/watch?v=qbvY0dQgSJ4
Reading Mode

默认显示中文,缺失的章节会自动回退到英文原文,保证这页随时可读。

章节 01 / 09

第01节

中文 中文暂未完整,先显示英文原文

Chip HuyenA question that get asked a lot and a lot is, "How do we keep up to date with the latest AI news?" Why do you need to keep up to date with the latest AI news? If you talk to the users who understand what they want or they don't want, look into the feedback, then you can actually improve the application way, way, way more.

Lenny RachitskyA lot of companies are building AI products. A lot of companies are not having a good time building AI products.

Chip HuyenWe are in an ideal crisis. Now, we have all this really cool tools to do everything from scratch and have new design. It can have you write code. You can have new website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don't know what to build.

Lenny RachitskyAll this AI hype, the data is actually showing most companies try it, doesn't do a lot. They stop. What do you think is the gap here?

Chip HuyenIt's really hard to measure productivity. So, I do ask people to ask their managers, "Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra head count?" Almost every one, the managers will say head count. But if you ask VP level or someone who manage a lot of teams, they would say, "Want AI assistant." Because as managers, you are still growing, so for you having one HR head count is big. Whereas for executives, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you.

Lenny RachitskyToday, my guest is Chip Huyen. Unlike a lot of people who share insights into building great AI products and where things are heading, Chip has built multiple successful AI products, platforms, tools. Chip was a core developer on NVIDIA's NeMo platform, an AI researcher at Netflix. She taught machine learning at Stanford. She's also a two-time founder and the author of two of the most popular books in the world of AI, including her most recent book called AI Engineering, which has been the most read book on the O'Reilly platform since its launch.

She's also gotten to work with a lot of enterprises on their AI strategies, and so she gets to see what's actually happening on the ground inside a lot of different companies. In our conversation, Chip explains a lot of the basics like, what exactly does pre-training and post-training look like? What is RAG? What is reinforcement learning? What is RLHF? We also get into everything she's learned about how to build great AI products, including what people think it takes and what it actually takes. We talk about the most common pitfalls that companies run into, where she's seeing the most productivity gains and so much more.

Chip HuyenHi, Lenny. I've been a big fan of the podcast for a while, so I'm really excited to be here. Thank you for having me.

Lenny RachitskyI want to start with this table/chart that you shared on LinkedIn a while ago that went super viral, and I think it went super viral because it hit a nerve with a lot of people. Let me just read this and we'll show this on YouTube for people that are watching. So it's this very simple table you shared of what people think will improve AI apps and what actually improves AI apps. What people think will improve AI apps, staying up to date with the latest AI news, adopting the newest agentic framework, agonizing about what vector databases to use, constantly evaluating what model is smarter, fine-tuning a model. And then you have what actually improves AI apps, talking to users, building more reliable platforms, preparing better data, optimizing end-to-end workflows, writing better prompts. Why do you think this hit such a nerve with people? If you had to boil it down, what do you think people are missing about building successful AI apps?

Chip Huyenquestion that get asked a lot and a lot is that, "How do we keep up to date with the latest AI news?" I'm like, "Why? Why do you need to keep up to date with the latest AI news?" I know it sound very counter-intuitive, but there's just so much news out there. A lot of people also ask me questions like, "How do I choose between two different technologies?" Maybe like recently, MCP versus agent-to-agent protocol? And it was like, "Which one is better or this or that?" I think it's a question you should ask them is like, "First, how much of the improvement could you get from optimal solutions versus non-optimal solutions?" Right? And sometimes they were like, "Actually, it's not much." Right?

I was like, "Okay, if it's not much improvement, then why do you want to spend so much time debating something that doesn't make that much difference to your performance?" Another question they ask is like, "If you adopted a new technology, how hard it could be to switch that out to another?" And sometimes they will like, "Oh, I think it could be a lot of work switching it out." And I'm just like, "Hmm, let's say here's a new technology. It hasn't been tested by a lot of people, and if you would adopt it, you would be stuck with it forever. Do you actually want to adopt it?" Maybe you want to think twice about over commit to new technologies that hasn't been better tested.

Lenny RachitskyI love your just broader advice is just simple like, to build successful AI apps, talk to users, build better data, write better prompts, optimize the user experience, versus just like, what is the latest and greatest? What's the best model to use right now? What's happening in AI? Let me follow this thread of this idea of fine-tuning and basically post-training. There's all these terms that people hear in AI, and I think this is going to be a really good opportunity for people to learn what we're actually talking about, since you actually do these things, you build these things, you work with companies doing these things. There's a few terms I want to sprinkle in through the conversation, but let's start with this one. What's the simplest way for someone to understand? What is the difference between pre-training and post-training and then just how fine-tuning fits into that, just what fine-tuning actually is?

Chip HuyenChip disclaimer, I don't have full visibility on what this big secretive frontier labs are doing. But right from what I heard, so I think it's like one is, supervised fine-tuning when you have demonstration data, and you have a bunch of experts, "Okay, here's a prompt, and here is what the answer should be like." You just train it to emulate what the human expert could be like. That's also what a lot of people would like, so open-source models are doing as they do it by distillation. So instead of having human experts to write really great answers to prompts, they get very popular, famous good models to generate a response to it and getting this train smaller models to emulate.

Sometimes you see people just like... So, that's because I really appreciate open source community by the way, but going from being able to train models that can emulate a existing good model. It's very different from being trained good models, like an output for existing good model. So, it's a big step there. Yeah, we have my supervised fine-tuning, and another thing that's very big, I'm not sure you have guests talking about it already, but reinforcement learning is everywhere.

Lenny RachitskyLet's pause on that because I definitely want to spend time on that, and that's such cool topic that's merging more and more in my conversations. But just to even summarize the things you just shared, which I think is really, really important stuff. So, the idea here is a model, essentially this algorithm piece of code that someone writes and say the frontier models are feeding it just like the entire internet of content, and basically, it's trying to test itself on predicting across all that data the next word, essentially. Token is the correct way of thinking about it, but a simpler way to think about it is the next word in text. As it gets it wrong, it adjusts these things called weights, essentially. Just like, is that a simple way to think about it, even that's just very surface level?

Chip HuyenSo, I think of language modeling as a way of encoding statistical information about language, right? So, let's say that we both speak English, so we get a sense of what is more statistically likely. If I say my favorite color is, then you would say, "Okay, that should be another color." The word blue would be much more likely to appear than the word like , right? Because statistically, blue is more likely to my favorite color is. So, it's a way of encoding statistical information.

So when language modeling, when you train a large amount of data, you see a lot of languages, a lot of domains. So it can tell, okay, your basic size is standard. Then the user do the prompts and it could come with the next most likely token. So by the way, it's not a new idea actually. So it's the idea comes very, very old, from the 1951 papers like English entropy. I think it's by Claude Shannon, it's a great paper. And I think it reveals a story I really like is from... Did you read Sherlock Holmes by the way?

Lenny RachitskyYeah, I read a few Sherlock Holmes books. Yeah.

Chip HuyenYeah. So this is story of when Sherlock Holmes says using this statistical information to help solve a case. So this is his story. There is somebody left a message with a lot of stick figures. So Sherlock Holmes was like, okay, he knows that in English, the most common letter is E. Then the most common stick figure must be E. And then he goes, he stopped like that, . So the code... So I think there's language. So in a way, it's simple language modeling, but instead of at a work level, he does this as character level and token is something in between, right? A token is not quite a word, but it's bigger than a character. So let's say we say token because it would help us reduce vocabulary because which character is smallest amount of vocabulary right now. So alphabet has 26 character, but words can have millions and millions, right? Whereas tokens, you can be able to get the sweet spot between the two.

So let's say that we have the new word, how to say it, like podcasting, right? Let's say it's a new word, but it can divide a podcast and ing. So people understand, okay, podcast, we know the meaning. We know that ing is a verb, gerund, whatever it is. So we even know the word podcasting so that's why the token comes in. But yeah, the pre-tuning is basically encoding statistical informations of language to have you predict what is most likely. I think that most likely is the most simple way of doing it because it's more building a distribution of, okay, so the next token could be more 90% of the the time it could be a color, 10% of the time could be something else. So it basically distribution so language could pick, depending on your sampling strategy. Do you want it to always pick the most likely token or do you want it to pick something more creative? So I think my sampling strategy, I think is something extremely important. It can have you boost a performance in a huge way and very, very underrated.

Lenny RachitskyOkay, awesome. So essentially, a model is just code with this whole set of weights, essentially the statistical model that has learned to predict what comes next after certain words and phrases?

Chip HuyenYeah.

Lenny RachitskyAnd then post-training and fine-tuning, specifically, is doing that same thing. So pre-training you get GPT5. Fine-tuning is someone taking GPT5 and doing the same sort of thing, adjusting these weights a little bit for specific use cases on data that they find is necessary to do their very specific use case. Is that a simple way to think about it?

Chip HuyenYeah, I think weights is functions, right? So let's say you have... Maybe it has a functions of maybe Lenny's height is maybe 1X plus something or 2X and plus something is the weights, right? So you change it until you fit the correct data, which is my height and your height. So you can think it's a weight, as just a weight, say function. So you train, adjust the weights so they can fit the data, which is the training data.

Lenny RachitskyAwesome. Okay. So we're talking about pre-training, post-training, fine-tuning. Is there anything else here that's important to share about just what this is exactly? What people need to understand about these parts of training?

Chip HuyenSo the vast majority of time, we don't touch on pre-training model. As users, we don't use it at all.

Lenny RachitskyRight. It's already done for us.

Chip HuyenYeah. So I think my is a bit of fun process when my friend's training model is they try to play with their pre-training model and they're horrendous. They're saying things like "Oh, my gosh." Yeah, it's crazy. So it's very interesting to look at how much of post-training can change the model behavior and I think that's where a lot of time, is a lot of people are spending energy on nowadays, their frontier lab, is on post-training. Because pre-training, I think... So pre-training have been used to increase the general capacity of capabilities of a model. And it needs a lot of data and model size to increase the model capabilities. And at some point, we are actually have kind of maxed out on the internet data. And people text data max out. I think a lot of people are doing with other data like audios and videos, and everyone's trying to think of what is the new source of data, but where like post-trading, but middle course of this is more of everyone have very similar pre-training data, is that post-training is where they make a big difference nowadays.

Lenny RachitskyThis is a good segue to, you talked about supervised learning versus unsupervised learning. I love, we're getting into this, by the way. This is super interesting. So you talk about labeled data. Basically, supervised learning is AI learning on data that somebody has already labeled and told it, here's correct versus incorrect. For example, this is spam versus not spam. This is a good short story. This is not a good short story. We've had the CEOs of a lot of these companies that do this for labs, Mercor and Scale, Handshake, there's Micro, there's a few others. So is that essentially what these companies are doing for labs, giving them labeled data, high-quality data to train on?

Chip HuyenIt is in a way, but I think it's more like a product of big equations. So there are a lot more different components than that. So that's why I was talking about reinforcement learning. I'm not sure if your CEO interview bring up that term. So the idea is that once you ... So let's say you have a model, give the model a prompt and it produce an output. You want to buy, once you reinforce, encourage the model to produce an output that is better. So now it comes to how do we know that the answer is good or bad? So usually, people relies on signals. So one way to get a first one good or bad is human feedback. They happen to be have two responses. You can, okay, this one one's better than the other. And we do that is because as humans, we tend to, it's very hard to give a concrete score, but it's easier to do comparisons.

If you ask me, okay, give this song a score, I'm not a musician and don't know how hard it is. It's like yeah, I don't know what, out of 10 I going to remove six. And if you ask me again a month from now and I completely forgotten, okay, maybe now seven, only four, I don't know. But then if you ask me, okay, here are two songs and which one would you prefer to play for the birthday party? I was like, "Okay, I can prefer this song." So comparisons a lot easier. So have a human, you have human feedback and then you use this human feedback to treat a reward model to tell which and then the reward model help you like, okay, it's a model that produce this response.
It's can score, is this good or bad? And you try to bias toward producing better model, the better responses. Another ways you can, instead of using a human, so you can use AI because the response and say good or bad, right? Or in fact the thing is that people are very big on nowadays, verifiable rewards, which it's natural. So basically, they give it a math problem and then math solutions is a model app a solution. Okay, it's expected response should be 42 and if it doesn't provide 42, then it's wrong. Now it's not a good response. So yes, a lot of time, people using this human laborer, human laborers should produce, how to say, expert questions and I say expected answers and in the ways that systems that verifiable so that the models can be trained on. Yeah.

Lenny RachitskyOkay, I'm really glad you went there. This is essentially RLHF reinforcement learning with human feedback, which is exactly what I wanted to also talk about, right?

Chip HuyenYeah. So I think it's general, it's a way of learning. It's training is learning and whether it learn from human feedback or AI feedback or verifiable rewards, I think I say it's just different way of collating signals.

Lenny RachitskyAwesome. Yeah. We had the CEO of Anthropic on the podcast and he talked about their version of RLHF, which is AI driven reinforcement learning. I love the way you phrased it where basically you want to help the model, you want to reinforce correct behavior and correct answers, and this is the method to do it, whether it's say an engineer seeing an output from a model being like, "No, here's how I would code it differently." And it's training a different model that the original model works with to tell it, am I correct or not correct? Is that right, roughly?

Chip HuyenYeah.

Lenny RachitskyOkay.

Chip HuyenI think that's a way of looking into it. I think that's a space is so exciting nowadays because there are so many domain expert task that the model developers want models to do well on, right? Let's say you're accountant. Maybe you want to use a model to have accounting task and need a lot of accounting data examples from accountant. So you need to hire a lot of them, should I do it or everyone physics problems, everyone should do, I don't know, legal questions and stuff or engineering questions or somebody was telling me they want to do, using coding to source scientific problems and not just coding to build product, which is another different whole realm of things. And I also using very specific toolings. I'm not sure what apps you use, but maybe like a app or QuickBooks or Google Excel. They have very specific tools, specific expertise. So you want the model learn.

So they need a lot of humans expert in this area should create data to train them and it's a massive thing people because everyone wants a lot of data and wants unlimited budget. But whether, I think this is also a little bit of low-key, interesting economics. I'm not sure you've talked to the guests about, I thought it's very interesting think about because it's very lopsided, right? Because they're only a very small numbers of frontier labs and they want a lot of data and there's a massive amount of startups or company providing related data. So you can see these companies like this startup doing data labeling. They have maybe some massive AR, but if you ask them, "Okay, so how many customers you have?" And they could be very small numbers, I'm not sure. I'm not sure you... I saw you smiling.

Lenny RachitskyYeah, yeah, yeah, we chatted about that.

Chip HuyenYeah, so I'm a little bit like uneasy. I have a company's growing crazy, but it's heavily dependent on two or three companies. And at the same time, if I was this company, frontier labs, what could be the right economical things for me to do? Now I want a lot of startups. I want to have a lot of providers so I can pick and choose, and as this providers can also to compete each other to lower the price and it's so dependent on regardless. So I feel like, yeah, so this whole economics is very interesting to me and I'm curious to see and how it plays out.

Lenny RachitskyWhat I'm hearing is you're bearish on the future of these data labeling companies because as you said, they don't have a lot of leverage over pricing because they have so few customers and there's so many people getting into the space. So basically, even though there's some of the fastest growing companies in the world, you're feeling like there's a challenge up ahead.

Chip HuyenI'm not sure if I'm bearish on it. I think I'm curious because I think things has a way of work out in ways that I don't expect. So I think that maybe these companies, they have a lot of data, maybe they wouldn't be able to use that to have some insight that helps them stay ahead of the curve. So I don't know.

Lenny RachitskyA very fair answer. Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast. This is the other piece of data content these companies share that AI labs really need. Can you just talk about what an eval is, the simplest way to understand it and then how this helps models get smarter?

Chip HuyenSo I think people approach eval, I think they're two very different problems. One is a app builder and can I say have an app that do maybe a chatbot? Very simple answer first thing that came to my mind and I want you to know if chatbot is good or bad. So it needs to come away with evaluate the chatbot. Another thing is, I think of this as a task-specific eval design. So let's say I'm a model developer and I want to make my model better at code writing. And it was like, "Okay, but how do I even measure code writing?"

So I need someone to understand code writing and think about what makes a story good and then design the whole dataset and then criteria to evaluate code writing. So yeah, I think there's that. I think it's more like eval design that is very interesting work criteria, work guidelines, how to do it and then also train people how to do it effectively. So I guess, , I think eval is really, really fun because it's extremely creative. I was looking at different eval people built and it was like, "Wow." It's not dry at all. It's just super, super, super fun.

Lenny RachitskyWe had a whole podcast on evals with Hamel and Shreya. That's exactly what they talked about is just, it's actually really fun to create evals for companies, especially. So let's still dig into that one a little bit more. There's this kind of debate online that, I don't know how big of a deal this debate is, but it feels like people spend a lot of time thinking about this, this idea of, do we need evals for AI products? Some of the best companies say they don't really do evals, they just go on vibes. They're just like, "Is this working well? Can I feel it or not?" What's your take on just the importance of building evals and the skill of evals for AI apps, not the model companies?

Chip HuyenYou don't have to be absolutely perfect, I think, to win. You just need to be good enough and being consistent about it. Okay, this is not a philosophy I follow, but I have worked with enough companies to see that play out. So when I say, why a company don't eval? Let's say you are an executive and you want to have a new use case. So here's a use case you started out, built and it's like it works well. The customers are somewhat happy. You don't have the exact metric for it.

So the traffic keeps increasing, people seem happy, people keep buying stuff and now here's our engineer coming like, "Okay, we need eval for it." And it was like, "Okay, how much effort do we need to go into eval?" And they were like, "Okay, maybe two engineers, this much, this much." And it could maybe would improve that and it was like, "Okay, so how much expected gain can I get from it?" And the engineer would be like, "Oh, maybe you can improve it from 80% to 82%, 85%."
And it was like, "Okay, but that two engineers and we going to launch a new feature, then it could give me so much more improvement." So I think it's one of them is eval. Sometimes people think of eval as like okay, this is good enough, just don't touch it. If you do spend a lot of energy on eval, it would only incremental improvement where it spends the energy on another use case and maybe good enough that you can vibe check it.
So I do think maybe that's a debate is about. I do think that a lot of time people just get things to the place where it's like, okay, good enough, people run. But in the end, but of course there's a lot of risk associated with it because if you don't have a clear metric, you have a good visibility to applications or models performing it might do something very dumb or it can cause you, I know something crazy can happen. So yeah, so I do think eval is very, very important if you have, if you operate a scale and where failures can have catastrophic consequences.
Then you do need to be very tyrannical about what you put in front of the users, understand different failure modes, what could go wrong and also maybe in a space when that it's a feature, the product is a competitive advantage. You want to be the best at it. So you want to have a very strong understanding of where you are and where you are with the competitors. But it's just something that's more a low-key, okay, this is like something is like, okay, that's not the core but it helps with our users.
Then maybe you don't need to be so obsessed or theoretical about it. It's like, okay, that's good enough for now and if it fails, then it fails. Okay, I know it's so terrifying. But yeah, I think it's all about the question of return investment. I'm a big fan of eval, I love reading eval. And I says, I understand why some people would choose to not focus on eval right away and choose bringing on new functionalities instead.

Lenny RachitskyAwesome. That is a really pragmatic answer. What I'm hearing is evals are great, very important, especially if you're operating at scale, but pick your battles. You don't need to write evals for every little feature. Something that Hamel and Shreya shared is that people need just, I don't know, five or seven evals for the most important elements of their product. Is that what you see or do you see a lot more in production that people build and need?

Chip HuyenI don't think of just a fixed number on the evals. What was the goal of eval? The goal of eval is to guide the product development. So you see eval, because I think I'm a big fan of eval, is that it helps you uncover opportunities where the progress are doing well. So sometimes, we've seen a very obvious where you look at the eval and we realize it's like, okay, it performed really poorly on this specific segment of users and then we look into it's like, okay, what's wrong with it? And it turns out, it's like we just don't have a good messaging to it. So people should just focus on the things that we're doing poorly, can improve significantly. Yeah, so I kind of like the number of eval is really depends. We have seen product with hundreds of different metrics.

Lenny RachitskyOh, wow.

Chip HuyenPeople going crazy, this is because that product is general, have different names, have one eval for, I don't know, verbosity, have one eval for user sensitive data and another is for length but has a number of, okay, let's just give a good example, concrete example, like deep research. So you have the application, you have views and model to do deep research for you. Okay, have a prompt. Let me say, okay, do me a comprehensive research on only Lenny's Podcast and help me propose, show me report on what kind of topics he's interested in, what kind of videos could get the most views or what topics that he's missing on that he should be covering, right? Have that prompt. Then how do you evaluate the result? I don't think there's one metrics that would help. Maybe it's like maybe you have a hundred, I think somebody has a benchmark and is get a hundred expert, write a bunch of prompts and they go through, on the answers on AI and do it. And it's extremely costly and slow.

But might have something else. First of all, one way I was thinking about it, I was talking to a friend about it and one way it's like, how would you produce the result of the summary? At first you need to, what you do, gather informations and to gather informations you need to do a lot of search queries. You gather, grab the search results and then some of the search results you aggregate and then maybe say, okay, I'm still missing on this. You have to go another route and on another route, have the summary. So every step of the way, you need evaluations. You don't end-to-end. Maybe it was a search query in my first thing about, okay, now I write five search queries. I might look into how good are the search queries? Do they as they similar to each other because you need five search queries are very similar? Okay, Lenny Podcast, Lenny Podcast last month, Lenny Podcast two months ago.
It's not very exciting. But if the query is a podcast, the keywords are more diverse and then look at the results of the search query and then say you enter the search query. Lenny Podcast data labeling and they come up with 10 pages, 10 results. And then you come up with like, oh, Lenny Podcast on, I don't know, frontier labs, and you have 10 results. different webpages. Okay, how much of them overlapping... Are we doing both the breadth, getting a lot of page, but also, do we have depth and also do you have relevance because if we come up with a search query, it's completely irrelevant to the original prompt. So I feel like every aspect of it, it would need a way of evaluating. So I don't think it's how many eval should I get, but how many eval do I need to get a good coverage, a high confidence in my application's performance and also to help me understand where it is not performing well so that I can fix it.

Lenny RachitskyAwesome. And I'm hearing also just especially for the very core use case, the most common path people take in your product is where you want to focus.

Chip HuyenYeah, yeah.

Lenny RachitskyOkay. There's one more term I want to cover and I want to go a somewhat different direction. RAG? People see this term a lot, R-A-G. What does it mean?

Chip HuyenSo RAG stands for Retrieval-Augmented Generations not a specific true generative AI. So the idea is just for a lot of questions, we need context to answer. So I think it came pretty, I think it's from the paper 2017. So someone was like, so they realized it's for a bunch of benchmark. When the question answering benchmarks, they realized it's like, okay, if we give the model informations about the questions, the next answer can be much, much better. So what they do with that is try to retrieve information from Wikipedia. So for question , just retrieve that and then put it into the context and answer. It does much better. So I feel like it sounds like a no-brainer, right? I mean, obviously. So I think that's what RAG is, as a simplest sense, it's just providing the model with a relevant context so that it can answer the questions. And it's where things get really more interesting because traditionally, when it started out, RAG is mostly text.

So we talk about a lot of way of how to prepare data so that the model can retrieve effectively. Let's say that not everything is a Wikipedia page. A Wikipedia page is pretty contained and you know, okay, everything about it is about a topic. But a lot of time, you have documents of like and they have a weird way of structures of documents. Let's say that you had documents about Lenny Podcast and in the future, in the beginning a document it's like, from now on, podcast wouldn't refer to Lenny's Podcast. So let's say somebody in the future is like, "Okay, tell me about Lenny. Lenny's work." And because as a document does not have the term Lenny, you just don't know, you might not retrieve it. And if the document is long enough that it's chunked into a different part, so the second part doesn't have the word Lenny, so you cannot reach it. So you have to find a way to process data. So that makes sure it's like... It can retrieve, the information is just relevant to the query even though it might not immediately obvious that it's related.
So people come up with only thing of, I think, contextual visual, like giving X chunk of the data, the relevant, maybe in a summary metadata so that it knows or some people use as hypothetical questions. It's very interesting for even the chunk of documents, I must generate a bunch of questions that the chunks can help answer so that when I have a query, it's like okay, does it match any of the hypothetical questions? It can fetch it. So it's very interesting approach. Okay, so maybe before I go to the next thing, I just want to say this data preparations for RAG is extremely important. And I would say this in a lot of the companies that I have seen, that's the biggest performance, in their RAG solutions coming from better data preparations, not agonizing over what databases to use because database, of course is very important to care about things like latency or if you have very specific access patterns like read-heavy or write-heavy, of course, it's like it matters. But in term of pure quality answers, I think the data preparation is just .

Lenny RachitskyWhen you say data preparation, what's an example to make that real and concrete for us to understand?

Chip HuyenSo one way is mentioned as in you have chunks of data. So we have think about how big of each chunk should be. Because if it's sort of think about it's a context you want to maximize, maybe you can, it's very simple example. You want to retrieve a thousand words. So if a data chunk is long, then it's more likely to contain more relevant metadata so it can retrieve more. But if it's too long then you have a thousand word. And so chunk is like a thousand words, you can reach one chunk. So it's not very useful. But if it's too short, then you can retrieve more relevant information also. It can retrieve a wider range of documents and chunks, but at the same time each chunk is too small to contain relevant information.

So we have very nice chunk design, how big each chunk should be. You add contextual informations like summary, metadata, hypothetical questions. Somebody was telling me just a very big performance they got is that from rewriting their data in the question-answering format. Instead of having... So they have a podcast instead of just chunking the podcast, you just reframe, rewrite it into here's a question, here's answers and produce a lot of them. It can use AI for that as well. So that's one example of data processing. A lot of example I see is for people helping, using AI to help specific use and documentations. And we write documentation. Usually a lot of documentation today is written for human reading and AI reading is different because it's different because humans, we have common sense and we kind of know what it is. So one things are, even for human experts, they have the context that AI doesn't quite have.
So somebody told me that what's a big change they have is let's say, that you have a function. The documentation for this, maybe the library. As a library said okay, the output of this one is maybe talking for, I don't know, some crazy term, maybe some temperature or something on the graph. It should be like one zero or minus one. And as a human expert maybe understand the scale, what one in the scale mean, but for AI, just really doesn't understand what that means. So actually, have another annotation layer for AI. It's like, okay, good temperatures equal one means like that. It's not like it's a actual temperature. It's associated with the scale over there. So just saving all this data processing to make it easier for AI to retrieve the relevant information to answer the questions.
Awesome. Okay. So you've talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things. I want to spend a little time here because a lot of companies are building AI products. A lot of companies are not having a good time building AI products. Let me ask a few questions along these lines of what you've learned working with companies that are doing this well. One is just, I guess, in terms of AI tool adoption and adoption in general within companies, there's all this talk recently of just all this AI hype. The data is actually showing most companies try it. Doesn't do a lot, they stop. And so there's all this just maybe this isn't going anywhere. So in terms of just adoption of tools in AI within companies, what are you seeing there?

Chip HuyenFor GenAI in company, I think there are two types of GenAI toolings that have been, I've seen ones is to internal productivity, like have coding tools, Slack chatbot, internal knowledge. A lot of big enterprises have some a wrapper around models, so with access to maybe some different type of a RAG solution. I think we talk about data or kind of like text-based RAG. We haven't talked about agentic RAG or I haven't talked about multi-modal RAG yet. But this, yes, it's a whole very exciting area around that. So basically, it should allow the employee to access internal document. Somebody ask, okay, I'm having a baby. What could be the maternal or paternal policy or am I having these operations with the health benefit cover that or I want you to interview, I want to refer my friend. What will be the process for that? So a lot of this having chatbot, internal chatbot to help with internal operations.

And another things, another category is more customer facing or partner facing. So product customers support chatbot is a big one. If you're a hotel chain, you might have a booking chatbot, which is somehow massive. A lot of booking chatbot because I guess it's... I do have this theory of a lot of applications companies pursue because they can't measure the concrete outcome. And I feel like booking or a sales chatbot, it's very clear. There was a conversion rate right now with that chatbot with human operators and what could be conversion rate with a chatbot and certain, somehow I think it's very clear outcomes and companies are easier to buy into these solutions. So a lot of companies have that customer facing chatbot.
So that is another category of tool and I think that for customers or external facing tools, because people are driven to choose applications with clear outcomes. So the questions of adopting them is really based on whether they see the outcome or not. Of course, it's not perfect because sometimes the outcome can be bad not because the idea or the application's idea is bad. It's just because the process of building it is not that great. Yeah. So it's tricky. For the internal adoptions of toolings or internal productivities, that's where it gets tricky. I would say a lot of companies think of AI strategy. I think of AI strategies usually have two key aspects. It's like use cases and the second is talent. You might have great data for great use cases, but you don't have talents and you cannot do it.
So a lot of time at the beginning with GenAI and sometimes I'm really admire a lot of companies for that, it's just like was like, okay, we need our employees to be very GenAI aware, very AI literate. So what I do is I start maybe adopting a bunch of tools for the team to use. They have a lot of up-skilling workshops, they encourage learning and then it's a really, really good thing. And it's also willing to spend a lot of money into adopting, giving people chargeability, subscriptions, purchase subscriptions, subscriptions to get the employees to be more AI literate. And that's the thing is a lot of... There's a may say, okay, we spend a ton of money on this tooling, but then we don't see because you can see the usage, but people don't seem to use them as much and what is the issue. So yeah, so I think that is tricky. Yeah.

Lenny RachitskyWhat do you think is the issue? Is it just they don't know how to use them? What do you think is the gap here? Do you think we'll get to a place of just like, wow, work is completely different because of AI for a lot of companies?

Chip HuyenThe main thing is it's really hard to measure productivity again. So I talk to a lot of people on their website. First of all, is coding. A lot of companies not using coding agents or coding coding. And I was asking, I was like, "Do you think that it helps with your productivity?" And a lot of times, the questions are very okay, I feel like it's better. And I said, okay, because we have more PRs, we see more code and then immediate . Okay, but of course, code, number of live code is not a good metric for that. So it's really, really tricky and it's something funny. So I do ask people to ask their managers because I work with usually VP level, so they have multiple teams under them. So I asked them, okay, do you ask some managers, okay, would you rather have access...

Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra headcount? Let's say maybe and almost everyone could say the managers could say headcount. But if you ask VP level or someone who manage a lot of teams, they would say just like good one, AI, a system as tools. And the reason is that we could say okay, because as manager is right, because you are still growing. You're not as a level when you manage hundreds of thousands of people. So for you, having one HR headcount is big. So you want that not for productivity reasons, but because you just want to have more people working for you. Whereas for executives, you care more about, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you. So it is tricky and I think that the question of productivity. I'm not sure it's fundamentally is the more productive, but it's just like we don't have a good way of measuring productivity improvement.
Another thing is also very . And I think it's like people do tell me that they notice different buckets of employees, different reactions to AI assist tools. First of all, I keep going back to coding because coding is big and it's easier to reason somehow. So it says I have different reports. One team would tell me that... One of people tell me, okay, amongst on his engineers, he thinks senior engineers would get the most output, would be more productive because it's like, okay, so that person's very interesting. So he actually divided his team to three buckets, but he didn't tell them, obviously. He was like, okay, here's more currently best performing, average performing and lowest performing. And then there's a randomized trial. So they give half of each group access to Cursor. And then noticed over time it was like, okay, something funny. The group that get the biggest performance boost, in his opinion, he was very close to his team.
The biggest performance boost the senior engineer, the highest performing. So the highest performing engineer get the biggest boost out of it. And then the second group is the average performing. So his opinion is like, okay, the highest performing engineers is also normal practice. They also know how to solve problems. So they have some solved problem better. Whereas the people who have the lowest performing, they only don't care much about work. So it's easier to just go on autopilot, get it to generate that code and just do it or just don't know how to do it. Another company, however, they tell me just actually, senior engineers are the one most resistant to using AI as this tooling because they said it's like, okay, but AI, because they are more opinionated and they have very high standard. It was like, okay, but AI code, code just sucks. So just very, very resistant in using it. So I don't know, I haven't quite been able to reconcile very different reports on that yet.

Lenny RachitskyThis is so interesting. So just to make sure I'm hearing what the story, so there's a company you work with, that did a three bucket test with their engineering team where they created three sorts of groups, the highest performing engineers, mid-performing engineers, lowest performing engineers, and gave some of them, so they gave some of them access to say, Cursor. Was it Cursor or what did they give them access to? It was Cursor, right?

Chip HuyenI think it was Cursor.

Lenny RachitskyOkay, cool. And so within-

Chip HuyenI didn't work with them. This is more like a friend company.

Lenny RachitskyOkay. It's a friend's company.

Chip HuyenYeah.

Lenny RachitskySo did they give half of the higher performing engineers Cursor and half not or how did they do the split there?

Chip HuyenYeah, so they give half of the entire company but half of each bucket. Yeah.

Lenny RachitskyWhoa.

Chip HuyenAnd then they observe the difference in productivity.

Lenny RachitskyI see. So how do they even do that? They're just like, "Okay, you get cursor, you don't get cursor." How did they do that? That's so interesting.

Chip HuyenYeah, I didn't get into the mechanics of it, but I was like, "I respect you for doing a randomized trial on that."

Lenny RachitskyThat is so cool.

Chip HuyenYeah. Yeah.

Lenny RachitskyOkay. Wow. How large was this engineering team? Was it like hundreds of people?

Chip HuyenIt's not that large. It's about maybe 30 to maybe 40. Yeah.

Lenny Rachitsky30 to 40. Okay.

Chip HuyenYeah.

Lenny RachitskyWow. Okay. So they found that the highest performing engineers had the most benefit from using AI tools and then behind them was the middle tier engineers and the worst performers or the lowest performers. Okay.

Chip HuyenBut it's also not the same everywhere.

Lenny RachitskyRight. Right. Right, right.

Chip HuyenSome companies are different.

Lenny RachitskyRight. This other example you shared of just senior engineers in this one example are most resistant to changing the way they work, which I get. I do feel like the most valuable people right now other than ML researchers and AI researchers like yourself, are senior engineers because it feels like junior engineers are just, so much of this is now done by AI, but an engineer that knows what they're doing that understands how things work at a large scale with AI tools, just basically infinite junior engineers doing their bidding, feels like an extremely valuable and powerful asset.

Chip HuyenYeah, I definitely really appreciate, as you see companies, we appreciate engineers who have a good understanding of the whole systems and be able to have good problem solving skill are thinking holistically instead of locally. Or when our company have seen the way they work, as they told me is we're completely different now. So they actually restructured engineering org so that they get more senior engineers should be more in the peer review because they get writing guidelines on what is a good engineering practices, what is the process would be like.

Or maybe like okay, so they write a lot of processes on how to work well. And then they have more junior engineers just produce code and submit PR, but senior engineer more in the reviewing case. So I think it might be prepared for the future. So another company actually told me something very similar. So preparing for the future once they only need a very small group of very, very strong engineers to create processes and reviewing code to get into production but get AI or junior engineers to produce code. But then the question becomes just like, how does one become a very strong senior engineer.

Lenny RachitskyRight. That's right. That's right. That's the problem. Yeah.

Chip HuyenYeah. So I don't know what's the process I was thinking about, yeah.

Lenny RachitskyNo one's thinking about it. It's a problem. We won't have any more in 10, 20 years. There'll be no more engineers because no one's hiring junior engineers. Although I could make the case. Junior engineers, people just getting into computer science right now, are just AI native. And in theory, you could argue they will become really good really fast if they're curious, aren't just delegating, learning and thinking to AI, but learning how to actually, using it to learn how to code well and architect correctly. You could argue they'll be the most successful engineers in the future.

Chip HuyenI do think that what I mentioned said relating to architect. I think I grouped that in my system thinking. I do think it's very important skill because I think AI can help automate a lot of disjointed skills, but knowing how to utilize the skills together to solve problems is hard. So that's a webinar between Mehran Sahami who is one my favorite professors. He was a chair of the curriculum at the CS Department at Stanford. So he spent a lot of time thinking about CS educations, what should students learn nowadays in the area of AI coding. And then the other person is Andrew Ng, which is of course, is a legend in the AI space. And Mehran Sahami, Professor Sahami, said something very interesting. He said a lot of people think that CS is about coding, but it's not. Coding is just a means to an end.

CS is about system thinking, using coding to solve actual problem and problem solving will never go away because what AI can automate more stuff. The problem is just get bigger. But as a process of understanding what caused the issue and how to design step-by-step solution to it, will always be there. So I think an example of, I actually have a lot of issues with AI for in the way of it's debugging. So I'm not sure you use a lot of AI for coding, but something I have noticed and also seen from my friends, it's like it is pretty good when you have very clear, well-defined tasks. Maybe write documentations, fix specific features or build an app from scratch. Doesn't have to interact with a large access in code base, but you added something a little bit more complicated, maybe required interaction with other components and stuff. It's usually not that good.
And for example, I was using AI to deploy an applications and it was testing out a new hosting service I was not familiar with. It was like, okay. Usually they inform me, so working AI does give me is confidence to try a new tool. Before what AI is like trying new tools has written, not documentations for the beginning, but I was like, okay, just try it out and learn. So I was testing out this new hosting service and it kept getting a bug, so was very, very annoying. And it was like, okay, I asked , fix it. And it kept changing the way, maybe change the environment variable, fix the code, maybe not change from the function to this function, maybe change the language, maybe it doesn't process JavaScript, I don't know, whatever. And it didn't work. And it was like, okay, that's it.
I'm just going to read documentation myself and see what's wrong. And it turns out, it's like I'm on another tier, the I want did not, is not available in this tier, right? So I feel like, okay, so the issue with was just trying to focus on fixing things from a different component versus the issue is from a different component. So I think of, okay, be understanding how different components work together and where the source of issue might come from. You need to give a holistic view of it. And it's made think is like, okay, how do we teach AI system thinking that I have all the human experts having very much scaffold just like, okay, for this kind of problem, look into this, look into that, look into that, and then stuff. So that could be one way, but that's also made me think is, how do we teach humans, system thinking? Yeah. So yeah, I think it's very interesting skill. I do think it's very important.

Lenny RachitskyThat's exactly the same insight Bret Taylor shared on the podcast. He's the co-founder of Sierra. He created Google Maps. He was CEO of Salesforce, Quip, a few other things. And I asked him just like, should people learn to code? And his point is exactly what you said, which is taking computer science classes is not about learning Java and Python. It's learning how systems work and how code operates and how software works broadly, not just, here's a function to do a thing.

One thing that I wanted to help people understand, you wrote this book called AI Engineering, which is essentially helping people understand this new genre of engineer and you have this really simple way of thinking about the difference between an ML engineer and an AI engineer, which has a really good corollary to product managers now, of just an AI product manager versus a non-AI product manager. The way you describe it and fill in what I'm missing is just ML engineers built models themselves. AI engineers use existing models to build products. Anything you want to add there?

Chip HuyenOne thing I really dislike about writing books is that it has to define this and I think it's like no definitions would be perfect because they always be edge cases. But yeah, in general, I think it's just like GenAI as a service, more as a service, when somebody build the models for you and the base model performance is a pretty . So it's like it's enabled people to just like, okay, now I want to integrate AI into my product. I don't need to learn even though knowing that could really help. But yeah, it makes an entry barrier really low for people who want to use AI to build product and at the same time, AI capabilities are so strong. It's also increased the possibilities, the type applications that AI can be used for. So I think yes, both entry barriers' is super low and a demand for AI applications a lot bigger. So it feels, it's very, very exciting. It's opens up a whole new ball of possibilities.

Lenny RachitskyYeah. It's like now you don't have the time, now you don't have to spend time building this AI brain. Now you can just use it to do stuff, such an unlock. Okay. Maybe just a final question. You get to see a lot of what's working, what's not working, where things are heading. I'm curious just if you had to think about in the next two or three years, just where things are heading, how do you think building products will be different? How do you think companies working will be different if you had to think of maybe the biggest change we expect to see in the next few years, in terms of how companies work?

Chip HuyenI think in a lot of organizations they don't move that fast, but at the same time, they move faster than I expected because again, I think it's like bias and don't work with dinosaur companies who don't care. I think a lot of executives who come to me are very forward-looking. So maybe for me, I'm very biased towards organizations is move fast. So yeah, I think one big change I see just in organizational structure. I think this a lot of value plays in... So before we have a lot of disjointed teams. We have very clear engineering team, product team, but then there's a question of who should write eval? Who should own the metrics? And it turns out, eval, it's not a separate problem. It's a system problem because you need to look into different components, how they interact with each other. You need user behaviors because you need to know what users care about so that you can write eval reflect what users care about.

So all of that you can sort it from you look into different component architectures, place guardrails and stuff. So it's just engineering, but understanding users is what product. So because of a lot of things and eval is extremely important. So the kind of bring product team and engineering team, even marketing team like user acquisition, very close to each other. So yes, since in a ways if people are structuring, so that's more communications between previously very distinct functions. Another thing is I also see as teams, of course, I think about what can be automated in the next few years and what work cannot be automated. And I seen that people already shedding, actually it's a little bit scary to think about it, but I also think it's the teams, they would've told me, it's just like okay, this is good and you and me, but we have got rid of these functions for a lot of things like previously outsourced, for example.
Traditionally, it's a business outsourcing that's not core to them and can be in a more systematized. So with that, you can actually use AI to automate a lot of that. And so as a separation people thinking more of what is the value of junior engineers or senior engineers, how should we restructure engineering org for that? Yeah, so I do definitely think that is one thing to successful organization. People are just moving pieces around and thinking about use cases, whether you need to spin out new use cases and who would lead a new effort. That is one big change. Another thing in terms of AI, I think there's, I'm not sure how true this is. I guess, I'm also on the camp of thinking that it has merit, is a camp of okay, base models we have probably not quite maxed out, but we're unlikely to see really, really strong, crazily strong model.
So you remember when we have GPT, right? And then GPT2, which is a big step up, an better than GPT and then GPT3, which much, much bigger than GPT4, much, much bigger. And then of course, GPT5, but it's GPT5, that scale of much bigger step jump compared to the previous, I think it's debatable. So I think that we had disappointment, the base model performance improvement is not going to be mind-blowing. It was in the last three years. So I think there's a lot of improvements when I see in the post-training phase, in the application building phase. And yes, also I think that's where I feel I would see a lot of improvement there. I also very interest in multimodality. So we've seen a lot of text base, but I think there's a lot of audio, videos use cases that is very, very exciting.
And I think audios is not quite as solved. Well, I think because I do work with a couple of voice startups and when it comes to, think about voice, it's an entirely different beast. So let's say have chatbot. We go from a text chatbot to voice chatbot. It's like the consoles are completely different because now with voice chatbot, we need to think about latency because I think multiple steps, first have voice to text, text to text, text question into text answer and then text to voice answer. So you have multiple hops and latency become very important. And there's a question, what does it make you sound natural? So for example, people think of in AI and humans, when humans talk to each other, if I say, you try to interrupt me and say, Chip . I would pause and I try to hear you out.
But sometime even if I just like say some word, like acknowledge when I, mm-hmm, mm-hmm, that I shouldn't stop. It's just continue. So the question of forced interruption and whether it's, should I stop or not, it's a big in what perceived as natural conversations. And that's also regulations because a lot of time, people want to build AI chatbot, voice chatbots that sound like humans, try to trick users into thinking that they're talking to humans, but also maybe potential regulation saying okay, you have to disclose to users when you talk, if the bot is talking to is human or AI. So I think this a whole space, I think it's not quite as solved as you think. But it's not quite like an AI foundation model problem because a human interruption detection, it's actually a classical machining problem.
It's a different framing, but you can give classifier for that. Or the question of latency, actually a massive engineering challenge, not an AI challenge. Of course, it can be an AI challenge because people are trying to build voice-to-voice model. So instead of having to firstly transcribe the voice from me into text and then get a model text answer and get another model should turn from text to speech, you can just do voice-to-voice directly. So that is something we're working on, but it's very hard. Yeah. So yeah, so even audio, I think of it's the easier than video because video have both image and voice. It's already pretty hard. So I think there's a lot of challenges in that space.

Lenny RachitskyThat was an awesome list of things. Let me mirror them back real quick. So what you're predicting in the next few years, things that will change in the way we work, and these actually resonate with so many conversations I've had on this podcast. So says, just kind of doubling down on where things are heading. One is the blurring of lines between different functions instead of just design engineering. Everyone's going to be doing a lot of different things now. Two is, just more of work being automated with agents and all these AI tools and just in theory, productivity going up. Third is, a shifting from pre-training models to post-training, fine-tuning and things like that because to your point, models maybe are slowing down in how smart they're getting.

Although, I'll point folks to the, I had a chat with the co-founder of Anthropic. He made a really good point here. He's like, we're really bad at understanding what exponentials feel like when we're in the middle of that. And also, models are being released more often. So the difference between them we may not notice because they're just happening more often versus GPT3 came out a year before after GPT2. Maybe true, maybe not. And then the fourth point you made is this idea of multimodal, investing in multimodal experiences. I cannot wait for ChatGPT voice mode to get better at interruption, exactly what you're saying. I'm just talking to it and then someone makes a little sound and it's like . Okay. And then you have to, and then it's like, and then it stops talking. It's so annoying.

Chip HuyenI'm shocked that we don't have better voice assistant at home yet. I think I have been testing out a bunch, honestly. I keep hoping, oh my God, that could be the one and then I know how many of them I just had to give away because they're not that good.

Lenny RachitskyI think it's coming. I hear it's coming. Anthropic's working with someone that, I don't know if it's launched or not yet.

Chip HuyenYeah, want to bring back to what you mentioned about your guest from Anthropic, mentioned about the performance improvement. I think there's a big change, I think this difference between a model-based capability. So I'm talking about the pre-trained model versus the perceived performance perform. So let's say, I'm not sure you thought about, are you familiar with the term test time compute?

Lenny RachitskyI don't think so. Help us understand.

Chip HuyenSo this idea is like okay, you have a fixed amount of compute. So you're going to spend a lot of compute on pre-training or training the model. Pre-training and then I've spent a lot of some compute fine-tuning and the ratio of pre-training to the post-training compute is crazy, varies between different lab. And also, since then has a spend compute on generate inference. When I have a trends and fine-tuning model and now you want to serve it to users. So I might type a question in a prompt and if generate, do inference and that requires a compute. And I guess, I feel about discussion of should I spend more compute on pre-training or fine-training or inference because inference and people thought I was just like test time compute. So spending more compute on inference is like calling test time compute as a strategy of just allocating more resources, compute resource to generate inference when I shouldn't bring better performance and how does that do it?

Let's say you have a math questions and maybe instead of just generate one answer again generate four different answers and say okay, whichever is the best according to some standard or okay, I have four answers and then maybe three of them say 42 and one of them says 20. You say okay, three of them in agreement. So the answer should be 42. So just people shouldn't generate a bunch of it. Or another thing is a lot of time like reasoning, thinking, it just be able to generate more thinking tokens, like spend more time thinking before showing the final answers. It's like require more compute but also give more better performance. So yeah, so I think it's like from the ease of perspective when the model spend more time exploring different potential answers, thinking longer, it can give you much better final answers. But the base model itself does not change.

Lenny RachitskyAwesome.

Chip HuyenDoes it make sense?

Lenny RachitskyYes, that does. Absolutely.

Chip HuyenYeah?

Lenny RachitskyThat is a good corollary to Ben Man's point.

Chip HuyenYeah.

Lenny RachitskyChip, we covered a lot of ground. I've gone through everything I was hoping to learn and more. Before we get to a very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with?

Chip HuyenSo I do work with a few companies that does these things of they want employees to come up with ideas. So there's a big debate on what is a better way for AI strategy, should they be top out or bottom up, should executives come up with one or two killer use case and everyone allocate resource to that, should you give engineers and PMs and smart people come up with ideas. And I think it's a mixture of both. So some companies it was like, okay, we hire a bunch of smart people, let's see what they come up with and they organize more hackathons or internal challenge to get people to build product. And one thing that I noticed, a lot of people just don't know what you built. And it shocked me why I feel like we are in some kind of an idea crisis, right?

Now, we have all this really cool tools to have. You do everything from scratch, can have you design, it can have you write code, it can build website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don't know what to build. And I think it's like, maybe you see a lot of had to do with maybe society expectations because we have gone into this phase of specializations, people very highly specialized and people are supposed to focus on one thing really well instead of being a big picture. And we don't have a big picture view. It's hard to come up with ideas of what you build.
So I know what, when I work with this company on this hackathon, we do work on come up with a guideline, how to come up with ideas. And usually what we think of is like, okay, one tip is go look from the last week. For a week, just pay attention to what you do and what frustrates you. And when something frustrates you, think about, is there anything we can do? Can it be done a different way? So it's not frustrating and you can talk, people can swap to accept or teams, and I even see they come on frustrations. Maybe there's something you can think about just to build something around that. So yeah, so I feel like just notice how we work, thinking of ways, constantly ask questions, how can this be better? And then I just build something to address the frustrations, I think it's a good way to learn and adopt AI.

Chip HuyenOH, I would love to see that. I'm very bullish on using AI, just create micro tools. It's just something just make your life a bit easier.

Lenny RachitskyA hundred percent. I feel like that's one of the main ways people are using these tools, just a little niche problem they have. With that, Chip, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?

Chip HuyenYeah, always. No, no, no. It depends on how hard the questions are.

Lenny RachitskyThey're very consistent across every guest. So I imagine you've heard them before. First question, what are two or three books that you find yourself recommending most to other people?

Chip HuyenI'm really terrified of book recommendations because I feel like what books you should read really depends on what they want and where they're in life and where they want to get to. But just several books that I do think's have really changed the way I think and see the world. So one thing is The Selfish Gene, that's to understand, it actually helped me with the question whether I want to have kids or not because it's understanding more of a lot of our functions, the way we operate is the functions of our genes and genes want to do one thing, to procreate.

So yes, in a way, the book also proposed another thing is so everyone wants to live forever and maybe it's not consciously, but subconsciously, we do want that. And there are two ways. One is via genes. Genes want to continue forever, but two ideas. I think there's something . It's just like being able, if you have some ideas out there and then it's last for a long time, it's going to live on. I know it's a little bit abstract, but I thought it's very interesting.
The other books I really, really like is from the book from Singaporean previous, I think he is as a Father of Singapore, I don't know, Lee Kuan Yew. I'm not sure what's the title is, but he was the one who led Singapore from, he's changed Singapore from a Third World country to a first world country within 25 years. And I have never seen any country leaders spent so much effort into putting down his thought of how to build a country like that.
And as I talk a lot about public policy, how to create policies that encourage people to do the right things that is good for the nations and also talking about foreign affairs, foreign policies, the liberation of the country, but other. So it's a really good book to think about. For me, it's a system thinking, but it's a different kind of system which a country, which a lot of us don't get a chance to ever experiment in our life. So it's good to learn about that.

Lenny RachitskyWhat was the name of that second book?

Chip HuyenIt's called From Third to First World. Actually, I think I have it somewhere here. Yeah.

Lenny RachitskyThere it is.

Chip HuyenIt's a very heavy book.

Lenny RachitskyShow and tell.

Chip HuyenYeah.

Lenny RachitskyThat's awesome. I definitely want to read that. That's a really good . I've heard a lot about just the impact he's had and I've seen all these videos on Twitter of just his really wise insights into how to build a thriving society. And clearly, it works.

Chip HuyenYeah. Can you believe, how does he time to write such a thick book? It's insane.

Lenny RachitskyThat is. Claude, please summarize. I'm just joking. By the way, Selfish Gene, I also absolutely love that book. That is such a good choice. It's such an under the radar kind of book that really changed the way I see the world as well. So really good pick. Okay, next question. Do you have a favorite recent movie or TV show you really enjoyed?

Chip HuyenSo I watched a lot of movie and TV shows as a research because I working on my first novel and I recently sold it. So I'm interested what makes, it's a drama. It's not a science fiction or anything that tech people usually read. So it very, I know it's a very out of left field and very, so it's almost like reading, watching TV to see what kind of stories become popular, trying to understand the trope and stuff like that. So I'm not sure if the audience will like...

Lenny RachitskyWell, what's one? What's one that taught you something about writing?

Chip HuyenI think like Yanxi Palace. It's a Chinese TV show.

Lenny RachitskyCool. Okay. I haven't heard that one on the podcast before. Okay, cool.

Chip HuyenYeah.

Lenny RachitskyNext question. Do you have a life motto that you often think about, come back to when you're dealing with something hard, whether it's in work or in life?

Chip HuyenThis sounds very nihilist. I think to say, in the end, nothing really matters. Usually, I think of in the grand scheme of things, in a billion years, nothing will, no one would ever be there. I think okay, someone will argue with me about that. . So my theory's like, in a billion years, none of us would ever exist. So whatever messy things, like crazy things we do or how bad we do it, I mean, no one would be remember, wouldn't be there to remember it. And I think in a way, it sounds scary, but it's very liberating because it just allows me say, okay, let's just try things out, right? Why does it matter? And there's a story of recently, so I have some family member who passed away recently. And I was talking to my dad because I couldn't be home for that.

I was asking my dad like, "Okay, os there anything I can do to make the person..." Something like comfort. So anything that you can get the persons. And my dad was just like, "What can he possibly want at this moment?" It just made me feel at the end of life, there's nothing that can bring you, like material can bring you joy. There's no money, no product, nothing. And in way, it makes me feel like, okay, what really do I really care about at the end of the day? So I guess it's like I think about it. It's just like, okay, maybe I fail it, maybe I don't get that contract. Maybe those things, but in the end of life, I don't think that actually really matters. So in a way, it's quite liberating.

Lenny RachitskyI know you said it might be nihilistic. This is what Steve Jobs shared too in one of his most famous speeches. Just we all die someday day, so don't take things so seriously and it is freeing. Absolutely. It just makes you appreciate every moment, every day you have. Just like, yeah, let's just do something hard and scary. Okay, final question. You talked about how you're writing a novel. Most people in tech have never written something creative and fiction. What's just one thing you learned in the process about how to write better stories, better fiction?

Chip HuyenA lot of time when we read, we get tripped up by some small things. So I think I want to do creative writing because I just want to go a better writer and it tells us maybe try a different audience could have me become better at anticipating what this different type of audience would want to hear and what they care about. So it's a way for me to get a... So I think if I write it or even any kind of content creations is about predicting the user's reactions, right?

Lenny RachitskyThe next token.

Chip HuyenYou do a podcast.

Lenny RachitskyJust kidding.

Chip HuyenYeah. Yeah, so you do a podcast, it's like, okay, what kind things that the users could find engaging, right? And I find this a little bit and a lot of companies you have launch a product, you have a narrative coming out and say, okay, how do we position this product in a way that users would want? So I feel like I have done technical writing for a while and I felt like I had some experience trying to predict what engineers would want to hear or care about. But then I don't have any experience like this, completely different type of audience. So that's what I want to, creative writing, writing a story. And that's why I was doing a lot of research . I mean, doing research enjoyed a lot, watching a lot of dramas. I just see what people like. So one thing that I care about is, I think I learned what emotional journey was from a editor.

So when we write something we care about how users would feel across a story. We want something in the beginning, we want something, we need to have a hook so that people continue reading. But we also don't want too much of drama because we'll get too tired because you're emotionally exhausted because it's like you're being emotionally manipulated a lot of time. So it gave a emotional journey, maybe have some climax or something more chill, maybe like... And also care about another thing I didn't realize is, for me, for technical writing, you entirely focus on the content, the argument. It's very impersonal. For example, people like ML compilers, doesn't matter if they like the person telling them about compiler or not because it's just objective . But for a novel, people care about character likeability.
So in the first version of my story, it makes the characters a little bit more, very logical, very rational, and just does everything just very rationally. And then the feedback I got is, I have a very good friend read it and he was like, he's an amazing person, he's a great person. And he was like, "Chip, I'll be honest, I hate that person." So it doesn't matter as a story, it's just like the person is so unlikeable, that's why he doesn't want to continue. So is a second version. It makes that person, the character more likable. How she makes that character more likable is that you put in some vulnerability sometimes it's like okay maybe it's person have setback because sometimes we can relate to it. So in a lot of ways, it's very interesting. A lot of it is about understand the emotional bit, like how the users feel, not just about the story but also about the characters.

Lenny RachitskyThat is so interesting. Wow. I learned a lot more there than I thought. That was awesome. Really good example. Chip, two final questions. Where can folks find you online, if they want to reach out and maybe work with you or maybe even just share the stuff that you offer if folks want to reach out. And then how can listeners be useful to you?

Chip HuyenI'm on social media, LinkedIn, Twitter. I don't post a lot, but I keep telling myself that I should do more because I kind of like the conversation with readers. So I'm actually about to I start a Substack. So I have a placeholder for Substack right now and I'm thinking of doing it for more system thinking because I think it's a very interesting skill. I'm also thinking of doing a YouTube channel on book reviews and basically books than help you think better. So I think it's the first book I'm a review is probably like this book because it's my favorite book growing up and I've been keep on reading it. So yeah, so how can you be helpful? Send me books that you like, books that help you have changed the way you think or change you the way you do anything. So I would appreciate it.

Lenny RachitskyAmazing. I'm excited to read that book.

Chip HuyenMm-hmm.

Lenny RachitskyChip, thank you so much for being here.

Chip HuyenThank you so much, Lenny, for having me.

Lenny RachitskyBye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

English Original transcript

Chip HuyenA question that get asked a lot and a lot is, "How do we keep up to date with the latest AI news?" Why do you need to keep up to date with the latest AI news? If you talk to the users who understand what they want or they don't want, look into the feedback, then you can actually improve the application way, way, way more.

Lenny RachitskyA lot of companies are building AI products. A lot of companies are not having a good time building AI products.

Chip HuyenWe are in an ideal crisis. Now, we have all this really cool tools to do everything from scratch and have new design. It can have you write code. You can have new website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don't know what to build.

Lenny RachitskyAll this AI hype, the data is actually showing most companies try it, doesn't do a lot. They stop. What do you think is the gap here?

Chip HuyenIt's really hard to measure productivity. So, I do ask people to ask their managers, "Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra head count?" Almost every one, the managers will say head count. But if you ask VP level or someone who manage a lot of teams, they would say, "Want AI assistant." Because as managers, you are still growing, so for you having one HR head count is big. Whereas for executives, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you.

Lenny RachitskyToday, my guest is Chip Huyen. Unlike a lot of people who share insights into building great AI products and where things are heading, Chip has built multiple successful AI products, platforms, tools. Chip was a core developer on NVIDIA's NeMo platform, an AI researcher at Netflix. She taught machine learning at Stanford. She's also a two-time founder and the author of two of the most popular books in the world of AI, including her most recent book called AI Engineering, which has been the most read book on the O'Reilly platform since its launch.

She's also gotten to work with a lot of enterprises on their AI strategies, and so she gets to see what's actually happening on the ground inside a lot of different companies. In our conversation, Chip explains a lot of the basics like, what exactly does pre-training and post-training look like? What is RAG? What is reinforcement learning? What is RLHF? We also get into everything she's learned about how to build great AI products, including what people think it takes and what it actually takes. We talk about the most common pitfalls that companies run into, where she's seeing the most productivity gains and so much more.

Chip HuyenHi, Lenny. I've been a big fan of the podcast for a while, so I'm really excited to be here. Thank you for having me.

Lenny RachitskyI want to start with this table/chart that you shared on LinkedIn a while ago that went super viral, and I think it went super viral because it hit a nerve with a lot of people. Let me just read this and we'll show this on YouTube for people that are watching. So it's this very simple table you shared of what people think will improve AI apps and what actually improves AI apps. What people think will improve AI apps, staying up to date with the latest AI news, adopting the newest agentic framework, agonizing about what vector databases to use, constantly evaluating what model is smarter, fine-tuning a model. And then you have what actually improves AI apps, talking to users, building more reliable platforms, preparing better data, optimizing end-to-end workflows, writing better prompts. Why do you think this hit such a nerve with people? If you had to boil it down, what do you think people are missing about building successful AI apps?

Chip Huyenquestion that get asked a lot and a lot is that, "How do we keep up to date with the latest AI news?" I'm like, "Why? Why do you need to keep up to date with the latest AI news?" I know it sound very counter-intuitive, but there's just so much news out there. A lot of people also ask me questions like, "How do I choose between two different technologies?" Maybe like recently, MCP versus agent-to-agent protocol? And it was like, "Which one is better or this or that?" I think it's a question you should ask them is like, "First, how much of the improvement could you get from optimal solutions versus non-optimal solutions?" Right? And sometimes they were like, "Actually, it's not much." Right?

I was like, "Okay, if it's not much improvement, then why do you want to spend so much time debating something that doesn't make that much difference to your performance?" Another question they ask is like, "If you adopted a new technology, how hard it could be to switch that out to another?" And sometimes they will like, "Oh, I think it could be a lot of work switching it out." And I'm just like, "Hmm, let's say here's a new technology. It hasn't been tested by a lot of people, and if you would adopt it, you would be stuck with it forever. Do you actually want to adopt it?" Maybe you want to think twice about over commit to new technologies that hasn't been better tested.

Lenny RachitskyI love your just broader advice is just simple like, to build successful AI apps, talk to users, build better data, write better prompts, optimize the user experience, versus just like, what is the latest and greatest? What's the best model to use right now? What's happening in AI? Let me follow this thread of this idea of fine-tuning and basically post-training. There's all these terms that people hear in AI, and I think this is going to be a really good opportunity for people to learn what we're actually talking about, since you actually do these things, you build these things, you work with companies doing these things. There's a few terms I want to sprinkle in through the conversation, but let's start with this one. What's the simplest way for someone to understand? What is the difference between pre-training and post-training and then just how fine-tuning fits into that, just what fine-tuning actually is?

Chip HuyenChip disclaimer, I don't have full visibility on what this big secretive frontier labs are doing. But right from what I heard, so I think it's like one is, supervised fine-tuning when you have demonstration data, and you have a bunch of experts, "Okay, here's a prompt, and here is what the answer should be like." You just train it to emulate what the human expert could be like. That's also what a lot of people would like, so open-source models are doing as they do it by distillation. So instead of having human experts to write really great answers to prompts, they get very popular, famous good models to generate a response to it and getting this train smaller models to emulate.

Sometimes you see people just like... So, that's because I really appreciate open source community by the way, but going from being able to train models that can emulate a existing good model. It's very different from being trained good models, like an output for existing good model. So, it's a big step there. Yeah, we have my supervised fine-tuning, and another thing that's very big, I'm not sure you have guests talking about it already, but reinforcement learning is everywhere.

Lenny RachitskyLet's pause on that because I definitely want to spend time on that, and that's such cool topic that's merging more and more in my conversations. But just to even summarize the things you just shared, which I think is really, really important stuff. So, the idea here is a model, essentially this algorithm piece of code that someone writes and say the frontier models are feeding it just like the entire internet of content, and basically, it's trying to test itself on predicting across all that data the next word, essentially. Token is the correct way of thinking about it, but a simpler way to think about it is the next word in text. As it gets it wrong, it adjusts these things called weights, essentially. Just like, is that a simple way to think about it, even that's just very surface level?

Chip HuyenSo, I think of language modeling as a way of encoding statistical information about language, right? So, let's say that we both speak English, so we get a sense of what is more statistically likely. If I say my favorite color is, then you would say, "Okay, that should be another color." The word blue would be much more likely to appear than the word like , right? Because statistically, blue is more likely to my favorite color is. So, it's a way of encoding statistical information.

So when language modeling, when you train a large amount of data, you see a lot of languages, a lot of domains. So it can tell, okay, your basic size is standard. Then the user do the prompts and it could come with the next most likely token. So by the way, it's not a new idea actually. So it's the idea comes very, very old, from the 1951 papers like English entropy. I think it's by Claude Shannon, it's a great paper. And I think it reveals a story I really like is from... Did you read Sherlock Holmes by the way?

Lenny RachitskyYeah, I read a few Sherlock Holmes books. Yeah.

Chip HuyenYeah. So this is story of when Sherlock Holmes says using this statistical information to help solve a case. So this is his story. There is somebody left a message with a lot of stick figures. So Sherlock Holmes was like, okay, he knows that in English, the most common letter is E. Then the most common stick figure must be E. And then he goes, he stopped like that, . So the code... So I think there's language. So in a way, it's simple language modeling, but instead of at a work level, he does this as character level and token is something in between, right? A token is not quite a word, but it's bigger than a character. So let's say we say token because it would help us reduce vocabulary because which character is smallest amount of vocabulary right now. So alphabet has 26 character, but words can have millions and millions, right? Whereas tokens, you can be able to get the sweet spot between the two.

So let's say that we have the new word, how to say it, like podcasting, right? Let's say it's a new word, but it can divide a podcast and ing. So people understand, okay, podcast, we know the meaning. We know that ing is a verb, gerund, whatever it is. So we even know the word podcasting so that's why the token comes in. But yeah, the pre-tuning is basically encoding statistical informations of language to have you predict what is most likely. I think that most likely is the most simple way of doing it because it's more building a distribution of, okay, so the next token could be more 90% of the the time it could be a color, 10% of the time could be something else. So it basically distribution so language could pick, depending on your sampling strategy. Do you want it to always pick the most likely token or do you want it to pick something more creative? So I think my sampling strategy, I think is something extremely important. It can have you boost a performance in a huge way and very, very underrated.

Lenny RachitskyOkay, awesome. So essentially, a model is just code with this whole set of weights, essentially the statistical model that has learned to predict what comes next after certain words and phrases?

Chip HuyenYeah.

Lenny RachitskyAnd then post-training and fine-tuning, specifically, is doing that same thing. So pre-training you get GPT5. Fine-tuning is someone taking GPT5 and doing the same sort of thing, adjusting these weights a little bit for specific use cases on data that they find is necessary to do their very specific use case. Is that a simple way to think about it?

Chip HuyenYeah, I think weights is functions, right? So let's say you have... Maybe it has a functions of maybe Lenny's height is maybe 1X plus something or 2X and plus something is the weights, right? So you change it until you fit the correct data, which is my height and your height. So you can think it's a weight, as just a weight, say function. So you train, adjust the weights so they can fit the data, which is the training data.

Lenny RachitskyAwesome. Okay. So we're talking about pre-training, post-training, fine-tuning. Is there anything else here that's important to share about just what this is exactly? What people need to understand about these parts of training?

Chip HuyenSo the vast majority of time, we don't touch on pre-training model. As users, we don't use it at all.

Lenny RachitskyRight. It's already done for us.

Chip HuyenYeah. So I think my is a bit of fun process when my friend's training model is they try to play with their pre-training model and they're horrendous. They're saying things like "Oh, my gosh." Yeah, it's crazy. So it's very interesting to look at how much of post-training can change the model behavior and I think that's where a lot of time, is a lot of people are spending energy on nowadays, their frontier lab, is on post-training. Because pre-training, I think... So pre-training have been used to increase the general capacity of capabilities of a model. And it needs a lot of data and model size to increase the model capabilities. And at some point, we are actually have kind of maxed out on the internet data. And people text data max out. I think a lot of people are doing with other data like audios and videos, and everyone's trying to think of what is the new source of data, but where like post-trading, but middle course of this is more of everyone have very similar pre-training data, is that post-training is where they make a big difference nowadays.

Lenny RachitskyThis is a good segue to, you talked about supervised learning versus unsupervised learning. I love, we're getting into this, by the way. This is super interesting. So you talk about labeled data. Basically, supervised learning is AI learning on data that somebody has already labeled and told it, here's correct versus incorrect. For example, this is spam versus not spam. This is a good short story. This is not a good short story. We've had the CEOs of a lot of these companies that do this for labs, Mercor and Scale, Handshake, there's Micro, there's a few others. So is that essentially what these companies are doing for labs, giving them labeled data, high-quality data to train on?

Chip HuyenIt is in a way, but I think it's more like a product of big equations. So there are a lot more different components than that. So that's why I was talking about reinforcement learning. I'm not sure if your CEO interview bring up that term. So the idea is that once you ... So let's say you have a model, give the model a prompt and it produce an output. You want to buy, once you reinforce, encourage the model to produce an output that is better. So now it comes to how do we know that the answer is good or bad? So usually, people relies on signals. So one way to get a first one good or bad is human feedback. They happen to be have two responses. You can, okay, this one one's better than the other. And we do that is because as humans, we tend to, it's very hard to give a concrete score, but it's easier to do comparisons.

If you ask me, okay, give this song a score, I'm not a musician and don't know how hard it is. It's like yeah, I don't know what, out of 10 I going to remove six. And if you ask me again a month from now and I completely forgotten, okay, maybe now seven, only four, I don't know. But then if you ask me, okay, here are two songs and which one would you prefer to play for the birthday party? I was like, "Okay, I can prefer this song." So comparisons a lot easier. So have a human, you have human feedback and then you use this human feedback to treat a reward model to tell which and then the reward model help you like, okay, it's a model that produce this response.
It's can score, is this good or bad? And you try to bias toward producing better model, the better responses. Another ways you can, instead of using a human, so you can use AI because the response and say good or bad, right? Or in fact the thing is that people are very big on nowadays, verifiable rewards, which it's natural. So basically, they give it a math problem and then math solutions is a model app a solution. Okay, it's expected response should be 42 and if it doesn't provide 42, then it's wrong. Now it's not a good response. So yes, a lot of time, people using this human laborer, human laborers should produce, how to say, expert questions and I say expected answers and in the ways that systems that verifiable so that the models can be trained on. Yeah.

Lenny RachitskyOkay, I'm really glad you went there. This is essentially RLHF reinforcement learning with human feedback, which is exactly what I wanted to also talk about, right?

Chip HuyenYeah. So I think it's general, it's a way of learning. It's training is learning and whether it learn from human feedback or AI feedback or verifiable rewards, I think I say it's just different way of collating signals.

Lenny RachitskyAwesome. Yeah. We had the CEO of Anthropic on the podcast and he talked about their version of RLHF, which is AI driven reinforcement learning. I love the way you phrased it where basically you want to help the model, you want to reinforce correct behavior and correct answers, and this is the method to do it, whether it's say an engineer seeing an output from a model being like, "No, here's how I would code it differently." And it's training a different model that the original model works with to tell it, am I correct or not correct? Is that right, roughly?

Chip HuyenYeah.

Lenny RachitskyOkay.

Chip HuyenI think that's a way of looking into it. I think that's a space is so exciting nowadays because there are so many domain expert task that the model developers want models to do well on, right? Let's say you're accountant. Maybe you want to use a model to have accounting task and need a lot of accounting data examples from accountant. So you need to hire a lot of them, should I do it or everyone physics problems, everyone should do, I don't know, legal questions and stuff or engineering questions or somebody was telling me they want to do, using coding to source scientific problems and not just coding to build product, which is another different whole realm of things. And I also using very specific toolings. I'm not sure what apps you use, but maybe like a app or QuickBooks or Google Excel. They have very specific tools, specific expertise. So you want the model learn.

So they need a lot of humans expert in this area should create data to train them and it's a massive thing people because everyone wants a lot of data and wants unlimited budget. But whether, I think this is also a little bit of low-key, interesting economics. I'm not sure you've talked to the guests about, I thought it's very interesting think about because it's very lopsided, right? Because they're only a very small numbers of frontier labs and they want a lot of data and there's a massive amount of startups or company providing related data. So you can see these companies like this startup doing data labeling. They have maybe some massive AR, but if you ask them, "Okay, so how many customers you have?" And they could be very small numbers, I'm not sure. I'm not sure you... I saw you smiling.

Lenny RachitskyYeah, yeah, yeah, we chatted about that.

Chip HuyenYeah, so I'm a little bit like uneasy. I have a company's growing crazy, but it's heavily dependent on two or three companies. And at the same time, if I was this company, frontier labs, what could be the right economical things for me to do? Now I want a lot of startups. I want to have a lot of providers so I can pick and choose, and as this providers can also to compete each other to lower the price and it's so dependent on regardless. So I feel like, yeah, so this whole economics is very interesting to me and I'm curious to see and how it plays out.

Lenny RachitskyWhat I'm hearing is you're bearish on the future of these data labeling companies because as you said, they don't have a lot of leverage over pricing because they have so few customers and there's so many people getting into the space. So basically, even though there's some of the fastest growing companies in the world, you're feeling like there's a challenge up ahead.

Chip HuyenI'm not sure if I'm bearish on it. I think I'm curious because I think things has a way of work out in ways that I don't expect. So I think that maybe these companies, they have a lot of data, maybe they wouldn't be able to use that to have some insight that helps them stay ahead of the curve. So I don't know.

Lenny RachitskyA very fair answer. Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast. This is the other piece of data content these companies share that AI labs really need. Can you just talk about what an eval is, the simplest way to understand it and then how this helps models get smarter?

Chip HuyenSo I think people approach eval, I think they're two very different problems. One is a app builder and can I say have an app that do maybe a chatbot? Very simple answer first thing that came to my mind and I want you to know if chatbot is good or bad. So it needs to come away with evaluate the chatbot. Another thing is, I think of this as a task-specific eval design. So let's say I'm a model developer and I want to make my model better at code writing. And it was like, "Okay, but how do I even measure code writing?"

So I need someone to understand code writing and think about what makes a story good and then design the whole dataset and then criteria to evaluate code writing. So yeah, I think there's that. I think it's more like eval design that is very interesting work criteria, work guidelines, how to do it and then also train people how to do it effectively. So I guess, , I think eval is really, really fun because it's extremely creative. I was looking at different eval people built and it was like, "Wow." It's not dry at all. It's just super, super, super fun.

Lenny RachitskyWe had a whole podcast on evals with Hamel and Shreya. That's exactly what they talked about is just, it's actually really fun to create evals for companies, especially. So let's still dig into that one a little bit more. There's this kind of debate online that, I don't know how big of a deal this debate is, but it feels like people spend a lot of time thinking about this, this idea of, do we need evals for AI products? Some of the best companies say they don't really do evals, they just go on vibes. They're just like, "Is this working well? Can I feel it or not?" What's your take on just the importance of building evals and the skill of evals for AI apps, not the model companies?

Chip HuyenYou don't have to be absolutely perfect, I think, to win. You just need to be good enough and being consistent about it. Okay, this is not a philosophy I follow, but I have worked with enough companies to see that play out. So when I say, why a company don't eval? Let's say you are an executive and you want to have a new use case. So here's a use case you started out, built and it's like it works well. The customers are somewhat happy. You don't have the exact metric for it.

So the traffic keeps increasing, people seem happy, people keep buying stuff and now here's our engineer coming like, "Okay, we need eval for it." And it was like, "Okay, how much effort do we need to go into eval?" And they were like, "Okay, maybe two engineers, this much, this much." And it could maybe would improve that and it was like, "Okay, so how much expected gain can I get from it?" And the engineer would be like, "Oh, maybe you can improve it from 80% to 82%, 85%."
And it was like, "Okay, but that two engineers and we going to launch a new feature, then it could give me so much more improvement." So I think it's one of them is eval. Sometimes people think of eval as like okay, this is good enough, just don't touch it. If you do spend a lot of energy on eval, it would only incremental improvement where it spends the energy on another use case and maybe good enough that you can vibe check it.
So I do think maybe that's a debate is about. I do think that a lot of time people just get things to the place where it's like, okay, good enough, people run. But in the end, but of course there's a lot of risk associated with it because if you don't have a clear metric, you have a good visibility to applications or models performing it might do something very dumb or it can cause you, I know something crazy can happen. So yeah, so I do think eval is very, very important if you have, if you operate a scale and where failures can have catastrophic consequences.
Then you do need to be very tyrannical about what you put in front of the users, understand different failure modes, what could go wrong and also maybe in a space when that it's a feature, the product is a competitive advantage. You want to be the best at it. So you want to have a very strong understanding of where you are and where you are with the competitors. But it's just something that's more a low-key, okay, this is like something is like, okay, that's not the core but it helps with our users.
Then maybe you don't need to be so obsessed or theoretical about it. It's like, okay, that's good enough for now and if it fails, then it fails. Okay, I know it's so terrifying. But yeah, I think it's all about the question of return investment. I'm a big fan of eval, I love reading eval. And I says, I understand why some people would choose to not focus on eval right away and choose bringing on new functionalities instead.

Lenny RachitskyAwesome. That is a really pragmatic answer. What I'm hearing is evals are great, very important, especially if you're operating at scale, but pick your battles. You don't need to write evals for every little feature. Something that Hamel and Shreya shared is that people need just, I don't know, five or seven evals for the most important elements of their product. Is that what you see or do you see a lot more in production that people build and need?

Chip HuyenI don't think of just a fixed number on the evals. What was the goal of eval? The goal of eval is to guide the product development. So you see eval, because I think I'm a big fan of eval, is that it helps you uncover opportunities where the progress are doing well. So sometimes, we've seen a very obvious where you look at the eval and we realize it's like, okay, it performed really poorly on this specific segment of users and then we look into it's like, okay, what's wrong with it? And it turns out, it's like we just don't have a good messaging to it. So people should just focus on the things that we're doing poorly, can improve significantly. Yeah, so I kind of like the number of eval is really depends. We have seen product with hundreds of different metrics.

Lenny RachitskyOh, wow.

Chip HuyenPeople going crazy, this is because that product is general, have different names, have one eval for, I don't know, verbosity, have one eval for user sensitive data and another is for length but has a number of, okay, let's just give a good example, concrete example, like deep research. So you have the application, you have views and model to do deep research for you. Okay, have a prompt. Let me say, okay, do me a comprehensive research on only Lenny's Podcast and help me propose, show me report on what kind of topics he's interested in, what kind of videos could get the most views or what topics that he's missing on that he should be covering, right? Have that prompt. Then how do you evaluate the result? I don't think there's one metrics that would help. Maybe it's like maybe you have a hundred, I think somebody has a benchmark and is get a hundred expert, write a bunch of prompts and they go through, on the answers on AI and do it. And it's extremely costly and slow.

But might have something else. First of all, one way I was thinking about it, I was talking to a friend about it and one way it's like, how would you produce the result of the summary? At first you need to, what you do, gather informations and to gather informations you need to do a lot of search queries. You gather, grab the search results and then some of the search results you aggregate and then maybe say, okay, I'm still missing on this. You have to go another route and on another route, have the summary. So every step of the way, you need evaluations. You don't end-to-end. Maybe it was a search query in my first thing about, okay, now I write five search queries. I might look into how good are the search queries? Do they as they similar to each other because you need five search queries are very similar? Okay, Lenny Podcast, Lenny Podcast last month, Lenny Podcast two months ago.
It's not very exciting. But if the query is a podcast, the keywords are more diverse and then look at the results of the search query and then say you enter the search query. Lenny Podcast data labeling and they come up with 10 pages, 10 results. And then you come up with like, oh, Lenny Podcast on, I don't know, frontier labs, and you have 10 results. different webpages. Okay, how much of them overlapping... Are we doing both the breadth, getting a lot of page, but also, do we have depth and also do you have relevance because if we come up with a search query, it's completely irrelevant to the original prompt. So I feel like every aspect of it, it would need a way of evaluating. So I don't think it's how many eval should I get, but how many eval do I need to get a good coverage, a high confidence in my application's performance and also to help me understand where it is not performing well so that I can fix it.

Lenny RachitskyAwesome. And I'm hearing also just especially for the very core use case, the most common path people take in your product is where you want to focus.

Chip HuyenYeah, yeah.

Lenny RachitskyOkay. There's one more term I want to cover and I want to go a somewhat different direction. RAG? People see this term a lot, R-A-G. What does it mean?

Chip HuyenSo RAG stands for Retrieval-Augmented Generations not a specific true generative AI. So the idea is just for a lot of questions, we need context to answer. So I think it came pretty, I think it's from the paper 2017. So someone was like, so they realized it's for a bunch of benchmark. When the question answering benchmarks, they realized it's like, okay, if we give the model informations about the questions, the next answer can be much, much better. So what they do with that is try to retrieve information from Wikipedia. So for question , just retrieve that and then put it into the context and answer. It does much better. So I feel like it sounds like a no-brainer, right? I mean, obviously. So I think that's what RAG is, as a simplest sense, it's just providing the model with a relevant context so that it can answer the questions. And it's where things get really more interesting because traditionally, when it started out, RAG is mostly text.

So we talk about a lot of way of how to prepare data so that the model can retrieve effectively. Let's say that not everything is a Wikipedia page. A Wikipedia page is pretty contained and you know, okay, everything about it is about a topic. But a lot of time, you have documents of like and they have a weird way of structures of documents. Let's say that you had documents about Lenny Podcast and in the future, in the beginning a document it's like, from now on, podcast wouldn't refer to Lenny's Podcast. So let's say somebody in the future is like, "Okay, tell me about Lenny. Lenny's work." And because as a document does not have the term Lenny, you just don't know, you might not retrieve it. And if the document is long enough that it's chunked into a different part, so the second part doesn't have the word Lenny, so you cannot reach it. So you have to find a way to process data. So that makes sure it's like... It can retrieve, the information is just relevant to the query even though it might not immediately obvious that it's related.
So people come up with only thing of, I think, contextual visual, like giving X chunk of the data, the relevant, maybe in a summary metadata so that it knows or some people use as hypothetical questions. It's very interesting for even the chunk of documents, I must generate a bunch of questions that the chunks can help answer so that when I have a query, it's like okay, does it match any of the hypothetical questions? It can fetch it. So it's very interesting approach. Okay, so maybe before I go to the next thing, I just want to say this data preparations for RAG is extremely important. And I would say this in a lot of the companies that I have seen, that's the biggest performance, in their RAG solutions coming from better data preparations, not agonizing over what databases to use because database, of course is very important to care about things like latency or if you have very specific access patterns like read-heavy or write-heavy, of course, it's like it matters. But in term of pure quality answers, I think the data preparation is just .

Lenny RachitskyWhen you say data preparation, what's an example to make that real and concrete for us to understand?

Chip HuyenSo one way is mentioned as in you have chunks of data. So we have think about how big of each chunk should be. Because if it's sort of think about it's a context you want to maximize, maybe you can, it's very simple example. You want to retrieve a thousand words. So if a data chunk is long, then it's more likely to contain more relevant metadata so it can retrieve more. But if it's too long then you have a thousand word. And so chunk is like a thousand words, you can reach one chunk. So it's not very useful. But if it's too short, then you can retrieve more relevant information also. It can retrieve a wider range of documents and chunks, but at the same time each chunk is too small to contain relevant information.

So we have very nice chunk design, how big each chunk should be. You add contextual informations like summary, metadata, hypothetical questions. Somebody was telling me just a very big performance they got is that from rewriting their data in the question-answering format. Instead of having... So they have a podcast instead of just chunking the podcast, you just reframe, rewrite it into here's a question, here's answers and produce a lot of them. It can use AI for that as well. So that's one example of data processing. A lot of example I see is for people helping, using AI to help specific use and documentations. And we write documentation. Usually a lot of documentation today is written for human reading and AI reading is different because it's different because humans, we have common sense and we kind of know what it is. So one things are, even for human experts, they have the context that AI doesn't quite have.
So somebody told me that what's a big change they have is let's say, that you have a function. The documentation for this, maybe the library. As a library said okay, the output of this one is maybe talking for, I don't know, some crazy term, maybe some temperature or something on the graph. It should be like one zero or minus one. And as a human expert maybe understand the scale, what one in the scale mean, but for AI, just really doesn't understand what that means. So actually, have another annotation layer for AI. It's like, okay, good temperatures equal one means like that. It's not like it's a actual temperature. It's associated with the scale over there. So just saving all this data processing to make it easier for AI to retrieve the relevant information to answer the questions.
Awesome. Okay. So you've talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things. I want to spend a little time here because a lot of companies are building AI products. A lot of companies are not having a good time building AI products. Let me ask a few questions along these lines of what you've learned working with companies that are doing this well. One is just, I guess, in terms of AI tool adoption and adoption in general within companies, there's all this talk recently of just all this AI hype. The data is actually showing most companies try it. Doesn't do a lot, they stop. And so there's all this just maybe this isn't going anywhere. So in terms of just adoption of tools in AI within companies, what are you seeing there?

Chip HuyenFor GenAI in company, I think there are two types of GenAI toolings that have been, I've seen ones is to internal productivity, like have coding tools, Slack chatbot, internal knowledge. A lot of big enterprises have some a wrapper around models, so with access to maybe some different type of a RAG solution. I think we talk about data or kind of like text-based RAG. We haven't talked about agentic RAG or I haven't talked about multi-modal RAG yet. But this, yes, it's a whole very exciting area around that. So basically, it should allow the employee to access internal document. Somebody ask, okay, I'm having a baby. What could be the maternal or paternal policy or am I having these operations with the health benefit cover that or I want you to interview, I want to refer my friend. What will be the process for that? So a lot of this having chatbot, internal chatbot to help with internal operations.

And another things, another category is more customer facing or partner facing. So product customers support chatbot is a big one. If you're a hotel chain, you might have a booking chatbot, which is somehow massive. A lot of booking chatbot because I guess it's... I do have this theory of a lot of applications companies pursue because they can't measure the concrete outcome. And I feel like booking or a sales chatbot, it's very clear. There was a conversion rate right now with that chatbot with human operators and what could be conversion rate with a chatbot and certain, somehow I think it's very clear outcomes and companies are easier to buy into these solutions. So a lot of companies have that customer facing chatbot.
So that is another category of tool and I think that for customers or external facing tools, because people are driven to choose applications with clear outcomes. So the questions of adopting them is really based on whether they see the outcome or not. Of course, it's not perfect because sometimes the outcome can be bad not because the idea or the application's idea is bad. It's just because the process of building it is not that great. Yeah. So it's tricky. For the internal adoptions of toolings or internal productivities, that's where it gets tricky. I would say a lot of companies think of AI strategy. I think of AI strategies usually have two key aspects. It's like use cases and the second is talent. You might have great data for great use cases, but you don't have talents and you cannot do it.
So a lot of time at the beginning with GenAI and sometimes I'm really admire a lot of companies for that, it's just like was like, okay, we need our employees to be very GenAI aware, very AI literate. So what I do is I start maybe adopting a bunch of tools for the team to use. They have a lot of up-skilling workshops, they encourage learning and then it's a really, really good thing. And it's also willing to spend a lot of money into adopting, giving people chargeability, subscriptions, purchase subscriptions, subscriptions to get the employees to be more AI literate. And that's the thing is a lot of... There's a may say, okay, we spend a ton of money on this tooling, but then we don't see because you can see the usage, but people don't seem to use them as much and what is the issue. So yeah, so I think that is tricky. Yeah.

Lenny RachitskyWhat do you think is the issue? Is it just they don't know how to use them? What do you think is the gap here? Do you think we'll get to a place of just like, wow, work is completely different because of AI for a lot of companies?

Chip HuyenThe main thing is it's really hard to measure productivity again. So I talk to a lot of people on their website. First of all, is coding. A lot of companies not using coding agents or coding coding. And I was asking, I was like, "Do you think that it helps with your productivity?" And a lot of times, the questions are very okay, I feel like it's better. And I said, okay, because we have more PRs, we see more code and then immediate . Okay, but of course, code, number of live code is not a good metric for that. So it's really, really tricky and it's something funny. So I do ask people to ask their managers because I work with usually VP level, so they have multiple teams under them. So I asked them, okay, do you ask some managers, okay, would you rather have access...

Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra headcount? Let's say maybe and almost everyone could say the managers could say headcount. But if you ask VP level or someone who manage a lot of teams, they would say just like good one, AI, a system as tools. And the reason is that we could say okay, because as manager is right, because you are still growing. You're not as a level when you manage hundreds of thousands of people. So for you, having one HR headcount is big. So you want that not for productivity reasons, but because you just want to have more people working for you. Whereas for executives, you care more about, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you. So it is tricky and I think that the question of productivity. I'm not sure it's fundamentally is the more productive, but it's just like we don't have a good way of measuring productivity improvement.
Another thing is also very . And I think it's like people do tell me that they notice different buckets of employees, different reactions to AI assist tools. First of all, I keep going back to coding because coding is big and it's easier to reason somehow. So it says I have different reports. One team would tell me that... One of people tell me, okay, amongst on his engineers, he thinks senior engineers would get the most output, would be more productive because it's like, okay, so that person's very interesting. So he actually divided his team to three buckets, but he didn't tell them, obviously. He was like, okay, here's more currently best performing, average performing and lowest performing. And then there's a randomized trial. So they give half of each group access to Cursor. And then noticed over time it was like, okay, something funny. The group that get the biggest performance boost, in his opinion, he was very close to his team.
The biggest performance boost the senior engineer, the highest performing. So the highest performing engineer get the biggest boost out of it. And then the second group is the average performing. So his opinion is like, okay, the highest performing engineers is also normal practice. They also know how to solve problems. So they have some solved problem better. Whereas the people who have the lowest performing, they only don't care much about work. So it's easier to just go on autopilot, get it to generate that code and just do it or just don't know how to do it. Another company, however, they tell me just actually, senior engineers are the one most resistant to using AI as this tooling because they said it's like, okay, but AI, because they are more opinionated and they have very high standard. It was like, okay, but AI code, code just sucks. So just very, very resistant in using it. So I don't know, I haven't quite been able to reconcile very different reports on that yet.

Lenny RachitskyThis is so interesting. So just to make sure I'm hearing what the story, so there's a company you work with, that did a three bucket test with their engineering team where they created three sorts of groups, the highest performing engineers, mid-performing engineers, lowest performing engineers, and gave some of them, so they gave some of them access to say, Cursor. Was it Cursor or what did they give them access to? It was Cursor, right?

Chip HuyenI think it was Cursor.

Lenny RachitskyOkay, cool. And so within-

Chip HuyenI didn't work with them. This is more like a friend company.

Lenny RachitskyOkay. It's a friend's company.

Chip HuyenYeah.

Lenny RachitskySo did they give half of the higher performing engineers Cursor and half not or how did they do the split there?

Chip HuyenYeah, so they give half of the entire company but half of each bucket. Yeah.

Lenny RachitskyWhoa.

Chip HuyenAnd then they observe the difference in productivity.

Lenny RachitskyI see. So how do they even do that? They're just like, "Okay, you get cursor, you don't get cursor." How did they do that? That's so interesting.

Chip HuyenYeah, I didn't get into the mechanics of it, but I was like, "I respect you for doing a randomized trial on that."

Lenny RachitskyThat is so cool.

Chip HuyenYeah. Yeah.

Lenny RachitskyOkay. Wow. How large was this engineering team? Was it like hundreds of people?

Chip HuyenIt's not that large. It's about maybe 30 to maybe 40. Yeah.

Lenny Rachitsky30 to 40. Okay.

Chip HuyenYeah.

Lenny RachitskyWow. Okay. So they found that the highest performing engineers had the most benefit from using AI tools and then behind them was the middle tier engineers and the worst performers or the lowest performers. Okay.

Chip HuyenBut it's also not the same everywhere.

Lenny RachitskyRight. Right. Right, right.

Chip HuyenSome companies are different.

Lenny RachitskyRight. This other example you shared of just senior engineers in this one example are most resistant to changing the way they work, which I get. I do feel like the most valuable people right now other than ML researchers and AI researchers like yourself, are senior engineers because it feels like junior engineers are just, so much of this is now done by AI, but an engineer that knows what they're doing that understands how things work at a large scale with AI tools, just basically infinite junior engineers doing their bidding, feels like an extremely valuable and powerful asset.

Chip HuyenYeah, I definitely really appreciate, as you see companies, we appreciate engineers who have a good understanding of the whole systems and be able to have good problem solving skill are thinking holistically instead of locally. Or when our company have seen the way they work, as they told me is we're completely different now. So they actually restructured engineering org so that they get more senior engineers should be more in the peer review because they get writing guidelines on what is a good engineering practices, what is the process would be like.

Or maybe like okay, so they write a lot of processes on how to work well. And then they have more junior engineers just produce code and submit PR, but senior engineer more in the reviewing case. So I think it might be prepared for the future. So another company actually told me something very similar. So preparing for the future once they only need a very small group of very, very strong engineers to create processes and reviewing code to get into production but get AI or junior engineers to produce code. But then the question becomes just like, how does one become a very strong senior engineer.

Lenny RachitskyRight. That's right. That's right. That's the problem. Yeah.

Chip HuyenYeah. So I don't know what's the process I was thinking about, yeah.

Lenny RachitskyNo one's thinking about it. It's a problem. We won't have any more in 10, 20 years. There'll be no more engineers because no one's hiring junior engineers. Although I could make the case. Junior engineers, people just getting into computer science right now, are just AI native. And in theory, you could argue they will become really good really fast if they're curious, aren't just delegating, learning and thinking to AI, but learning how to actually, using it to learn how to code well and architect correctly. You could argue they'll be the most successful engineers in the future.

Chip HuyenI do think that what I mentioned said relating to architect. I think I grouped that in my system thinking. I do think it's very important skill because I think AI can help automate a lot of disjointed skills, but knowing how to utilize the skills together to solve problems is hard. So that's a webinar between Mehran Sahami who is one my favorite professors. He was a chair of the curriculum at the CS Department at Stanford. So he spent a lot of time thinking about CS educations, what should students learn nowadays in the area of AI coding. And then the other person is Andrew Ng, which is of course, is a legend in the AI space. And Mehran Sahami, Professor Sahami, said something very interesting. He said a lot of people think that CS is about coding, but it's not. Coding is just a means to an end.

CS is about system thinking, using coding to solve actual problem and problem solving will never go away because what AI can automate more stuff. The problem is just get bigger. But as a process of understanding what caused the issue and how to design step-by-step solution to it, will always be there. So I think an example of, I actually have a lot of issues with AI for in the way of it's debugging. So I'm not sure you use a lot of AI for coding, but something I have noticed and also seen from my friends, it's like it is pretty good when you have very clear, well-defined tasks. Maybe write documentations, fix specific features or build an app from scratch. Doesn't have to interact with a large access in code base, but you added something a little bit more complicated, maybe required interaction with other components and stuff. It's usually not that good.
And for example, I was using AI to deploy an applications and it was testing out a new hosting service I was not familiar with. It was like, okay. Usually they inform me, so working AI does give me is confidence to try a new tool. Before what AI is like trying new tools has written, not documentations for the beginning, but I was like, okay, just try it out and learn. So I was testing out this new hosting service and it kept getting a bug, so was very, very annoying. And it was like, okay, I asked , fix it. And it kept changing the way, maybe change the environment variable, fix the code, maybe not change from the function to this function, maybe change the language, maybe it doesn't process JavaScript, I don't know, whatever. And it didn't work. And it was like, okay, that's it.
I'm just going to read documentation myself and see what's wrong. And it turns out, it's like I'm on another tier, the I want did not, is not available in this tier, right? So I feel like, okay, so the issue with was just trying to focus on fixing things from a different component versus the issue is from a different component. So I think of, okay, be understanding how different components work together and where the source of issue might come from. You need to give a holistic view of it. And it's made think is like, okay, how do we teach AI system thinking that I have all the human experts having very much scaffold just like, okay, for this kind of problem, look into this, look into that, look into that, and then stuff. So that could be one way, but that's also made me think is, how do we teach humans, system thinking? Yeah. So yeah, I think it's very interesting skill. I do think it's very important.

Lenny RachitskyThat's exactly the same insight Bret Taylor shared on the podcast. He's the co-founder of Sierra. He created Google Maps. He was CEO of Salesforce, Quip, a few other things. And I asked him just like, should people learn to code? And his point is exactly what you said, which is taking computer science classes is not about learning Java and Python. It's learning how systems work and how code operates and how software works broadly, not just, here's a function to do a thing.

One thing that I wanted to help people understand, you wrote this book called AI Engineering, which is essentially helping people understand this new genre of engineer and you have this really simple way of thinking about the difference between an ML engineer and an AI engineer, which has a really good corollary to product managers now, of just an AI product manager versus a non-AI product manager. The way you describe it and fill in what I'm missing is just ML engineers built models themselves. AI engineers use existing models to build products. Anything you want to add there?

Chip HuyenOne thing I really dislike about writing books is that it has to define this and I think it's like no definitions would be perfect because they always be edge cases. But yeah, in general, I think it's just like GenAI as a service, more as a service, when somebody build the models for you and the base model performance is a pretty . So it's like it's enabled people to just like, okay, now I want to integrate AI into my product. I don't need to learn even though knowing that could really help. But yeah, it makes an entry barrier really low for people who want to use AI to build product and at the same time, AI capabilities are so strong. It's also increased the possibilities, the type applications that AI can be used for. So I think yes, both entry barriers' is super low and a demand for AI applications a lot bigger. So it feels, it's very, very exciting. It's opens up a whole new ball of possibilities.

Lenny RachitskyYeah. It's like now you don't have the time, now you don't have to spend time building this AI brain. Now you can just use it to do stuff, such an unlock. Okay. Maybe just a final question. You get to see a lot of what's working, what's not working, where things are heading. I'm curious just if you had to think about in the next two or three years, just where things are heading, how do you think building products will be different? How do you think companies working will be different if you had to think of maybe the biggest change we expect to see in the next few years, in terms of how companies work?

Chip HuyenI think in a lot of organizations they don't move that fast, but at the same time, they move faster than I expected because again, I think it's like bias and don't work with dinosaur companies who don't care. I think a lot of executives who come to me are very forward-looking. So maybe for me, I'm very biased towards organizations is move fast. So yeah, I think one big change I see just in organizational structure. I think this a lot of value plays in... So before we have a lot of disjointed teams. We have very clear engineering team, product team, but then there's a question of who should write eval? Who should own the metrics? And it turns out, eval, it's not a separate problem. It's a system problem because you need to look into different components, how they interact with each other. You need user behaviors because you need to know what users care about so that you can write eval reflect what users care about.

So all of that you can sort it from you look into different component architectures, place guardrails and stuff. So it's just engineering, but understanding users is what product. So because of a lot of things and eval is extremely important. So the kind of bring product team and engineering team, even marketing team like user acquisition, very close to each other. So yes, since in a ways if people are structuring, so that's more communications between previously very distinct functions. Another thing is I also see as teams, of course, I think about what can be automated in the next few years and what work cannot be automated. And I seen that people already shedding, actually it's a little bit scary to think about it, but I also think it's the teams, they would've told me, it's just like okay, this is good and you and me, but we have got rid of these functions for a lot of things like previously outsourced, for example.
Traditionally, it's a business outsourcing that's not core to them and can be in a more systematized. So with that, you can actually use AI to automate a lot of that. And so as a separation people thinking more of what is the value of junior engineers or senior engineers, how should we restructure engineering org for that? Yeah, so I do definitely think that is one thing to successful organization. People are just moving pieces around and thinking about use cases, whether you need to spin out new use cases and who would lead a new effort. That is one big change. Another thing in terms of AI, I think there's, I'm not sure how true this is. I guess, I'm also on the camp of thinking that it has merit, is a camp of okay, base models we have probably not quite maxed out, but we're unlikely to see really, really strong, crazily strong model.
So you remember when we have GPT, right? And then GPT2, which is a big step up, an better than GPT and then GPT3, which much, much bigger than GPT4, much, much bigger. And then of course, GPT5, but it's GPT5, that scale of much bigger step jump compared to the previous, I think it's debatable. So I think that we had disappointment, the base model performance improvement is not going to be mind-blowing. It was in the last three years. So I think there's a lot of improvements when I see in the post-training phase, in the application building phase. And yes, also I think that's where I feel I would see a lot of improvement there. I also very interest in multimodality. So we've seen a lot of text base, but I think there's a lot of audio, videos use cases that is very, very exciting.
And I think audios is not quite as solved. Well, I think because I do work with a couple of voice startups and when it comes to, think about voice, it's an entirely different beast. So let's say have chatbot. We go from a text chatbot to voice chatbot. It's like the consoles are completely different because now with voice chatbot, we need to think about latency because I think multiple steps, first have voice to text, text to text, text question into text answer and then text to voice answer. So you have multiple hops and latency become very important. And there's a question, what does it make you sound natural? So for example, people think of in AI and humans, when humans talk to each other, if I say, you try to interrupt me and say, Chip . I would pause and I try to hear you out.
But sometime even if I just like say some word, like acknowledge when I, mm-hmm, mm-hmm, that I shouldn't stop. It's just continue. So the question of forced interruption and whether it's, should I stop or not, it's a big in what perceived as natural conversations. And that's also regulations because a lot of time, people want to build AI chatbot, voice chatbots that sound like humans, try to trick users into thinking that they're talking to humans, but also maybe potential regulation saying okay, you have to disclose to users when you talk, if the bot is talking to is human or AI. So I think this a whole space, I think it's not quite as solved as you think. But it's not quite like an AI foundation model problem because a human interruption detection, it's actually a classical machining problem.
It's a different framing, but you can give classifier for that. Or the question of latency, actually a massive engineering challenge, not an AI challenge. Of course, it can be an AI challenge because people are trying to build voice-to-voice model. So instead of having to firstly transcribe the voice from me into text and then get a model text answer and get another model should turn from text to speech, you can just do voice-to-voice directly. So that is something we're working on, but it's very hard. Yeah. So yeah, so even audio, I think of it's the easier than video because video have both image and voice. It's already pretty hard. So I think there's a lot of challenges in that space.

Lenny RachitskyThat was an awesome list of things. Let me mirror them back real quick. So what you're predicting in the next few years, things that will change in the way we work, and these actually resonate with so many conversations I've had on this podcast. So says, just kind of doubling down on where things are heading. One is the blurring of lines between different functions instead of just design engineering. Everyone's going to be doing a lot of different things now. Two is, just more of work being automated with agents and all these AI tools and just in theory, productivity going up. Third is, a shifting from pre-training models to post-training, fine-tuning and things like that because to your point, models maybe are slowing down in how smart they're getting.

Although, I'll point folks to the, I had a chat with the co-founder of Anthropic. He made a really good point here. He's like, we're really bad at understanding what exponentials feel like when we're in the middle of that. And also, models are being released more often. So the difference between them we may not notice because they're just happening more often versus GPT3 came out a year before after GPT2. Maybe true, maybe not. And then the fourth point you made is this idea of multimodal, investing in multimodal experiences. I cannot wait for ChatGPT voice mode to get better at interruption, exactly what you're saying. I'm just talking to it and then someone makes a little sound and it's like . Okay. And then you have to, and then it's like, and then it stops talking. It's so annoying.

Chip HuyenI'm shocked that we don't have better voice assistant at home yet. I think I have been testing out a bunch, honestly. I keep hoping, oh my God, that could be the one and then I know how many of them I just had to give away because they're not that good.

Lenny RachitskyI think it's coming. I hear it's coming. Anthropic's working with someone that, I don't know if it's launched or not yet.

Chip HuyenYeah, want to bring back to what you mentioned about your guest from Anthropic, mentioned about the performance improvement. I think there's a big change, I think this difference between a model-based capability. So I'm talking about the pre-trained model versus the perceived performance perform. So let's say, I'm not sure you thought about, are you familiar with the term test time compute?

Lenny RachitskyI don't think so. Help us understand.

Chip HuyenSo this idea is like okay, you have a fixed amount of compute. So you're going to spend a lot of compute on pre-training or training the model. Pre-training and then I've spent a lot of some compute fine-tuning and the ratio of pre-training to the post-training compute is crazy, varies between different lab. And also, since then has a spend compute on generate inference. When I have a trends and fine-tuning model and now you want to serve it to users. So I might type a question in a prompt and if generate, do inference and that requires a compute. And I guess, I feel about discussion of should I spend more compute on pre-training or fine-training or inference because inference and people thought I was just like test time compute. So spending more compute on inference is like calling test time compute as a strategy of just allocating more resources, compute resource to generate inference when I shouldn't bring better performance and how does that do it?

Let's say you have a math questions and maybe instead of just generate one answer again generate four different answers and say okay, whichever is the best according to some standard or okay, I have four answers and then maybe three of them say 42 and one of them says 20. You say okay, three of them in agreement. So the answer should be 42. So just people shouldn't generate a bunch of it. Or another thing is a lot of time like reasoning, thinking, it just be able to generate more thinking tokens, like spend more time thinking before showing the final answers. It's like require more compute but also give more better performance. So yeah, so I think it's like from the ease of perspective when the model spend more time exploring different potential answers, thinking longer, it can give you much better final answers. But the base model itself does not change.

Lenny RachitskyAwesome.

Chip HuyenDoes it make sense?

Lenny RachitskyYes, that does. Absolutely.

Chip HuyenYeah?

Lenny RachitskyThat is a good corollary to Ben Man's point.

Chip HuyenYeah.

Lenny RachitskyChip, we covered a lot of ground. I've gone through everything I was hoping to learn and more. Before we get to a very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with?

Chip HuyenSo I do work with a few companies that does these things of they want employees to come up with ideas. So there's a big debate on what is a better way for AI strategy, should they be top out or bottom up, should executives come up with one or two killer use case and everyone allocate resource to that, should you give engineers and PMs and smart people come up with ideas. And I think it's a mixture of both. So some companies it was like, okay, we hire a bunch of smart people, let's see what they come up with and they organize more hackathons or internal challenge to get people to build product. And one thing that I noticed, a lot of people just don't know what you built. And it shocked me why I feel like we are in some kind of an idea crisis, right?

Now, we have all this really cool tools to have. You do everything from scratch, can have you design, it can have you write code, it can build website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don't know what to build. And I think it's like, maybe you see a lot of had to do with maybe society expectations because we have gone into this phase of specializations, people very highly specialized and people are supposed to focus on one thing really well instead of being a big picture. And we don't have a big picture view. It's hard to come up with ideas of what you build.
So I know what, when I work with this company on this hackathon, we do work on come up with a guideline, how to come up with ideas. And usually what we think of is like, okay, one tip is go look from the last week. For a week, just pay attention to what you do and what frustrates you. And when something frustrates you, think about, is there anything we can do? Can it be done a different way? So it's not frustrating and you can talk, people can swap to accept or teams, and I even see they come on frustrations. Maybe there's something you can think about just to build something around that. So yeah, so I feel like just notice how we work, thinking of ways, constantly ask questions, how can this be better? And then I just build something to address the frustrations, I think it's a good way to learn and adopt AI.

Chip HuyenOH, I would love to see that. I'm very bullish on using AI, just create micro tools. It's just something just make your life a bit easier.

Lenny RachitskyA hundred percent. I feel like that's one of the main ways people are using these tools, just a little niche problem they have. With that, Chip, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?

Chip HuyenYeah, always. No, no, no. It depends on how hard the questions are.

Lenny RachitskyThey're very consistent across every guest. So I imagine you've heard them before. First question, what are two or three books that you find yourself recommending most to other people?

Chip HuyenI'm really terrified of book recommendations because I feel like what books you should read really depends on what they want and where they're in life and where they want to get to. But just several books that I do think's have really changed the way I think and see the world. So one thing is The Selfish Gene, that's to understand, it actually helped me with the question whether I want to have kids or not because it's understanding more of a lot of our functions, the way we operate is the functions of our genes and genes want to do one thing, to procreate.

So yes, in a way, the book also proposed another thing is so everyone wants to live forever and maybe it's not consciously, but subconsciously, we do want that. And there are two ways. One is via genes. Genes want to continue forever, but two ideas. I think there's something . It's just like being able, if you have some ideas out there and then it's last for a long time, it's going to live on. I know it's a little bit abstract, but I thought it's very interesting.
The other books I really, really like is from the book from Singaporean previous, I think he is as a Father of Singapore, I don't know, Lee Kuan Yew. I'm not sure what's the title is, but he was the one who led Singapore from, he's changed Singapore from a Third World country to a first world country within 25 years. And I have never seen any country leaders spent so much effort into putting down his thought of how to build a country like that.
And as I talk a lot about public policy, how to create policies that encourage people to do the right things that is good for the nations and also talking about foreign affairs, foreign policies, the liberation of the country, but other. So it's a really good book to think about. For me, it's a system thinking, but it's a different kind of system which a country, which a lot of us don't get a chance to ever experiment in our life. So it's good to learn about that.

Lenny RachitskyWhat was the name of that second book?

Chip HuyenIt's called From Third to First World. Actually, I think I have it somewhere here. Yeah.

Lenny RachitskyThere it is.

Chip HuyenIt's a very heavy book.

Lenny RachitskyShow and tell.

Chip HuyenYeah.

Lenny RachitskyThat's awesome. I definitely want to read that. That's a really good . I've heard a lot about just the impact he's had and I've seen all these videos on Twitter of just his really wise insights into how to build a thriving society. And clearly, it works.

Chip HuyenYeah. Can you believe, how does he time to write such a thick book? It's insane.

Lenny RachitskyThat is. Claude, please summarize. I'm just joking. By the way, Selfish Gene, I also absolutely love that book. That is such a good choice. It's such an under the radar kind of book that really changed the way I see the world as well. So really good pick. Okay, next question. Do you have a favorite recent movie or TV show you really enjoyed?

Chip HuyenSo I watched a lot of movie and TV shows as a research because I working on my first novel and I recently sold it. So I'm interested what makes, it's a drama. It's not a science fiction or anything that tech people usually read. So it very, I know it's a very out of left field and very, so it's almost like reading, watching TV to see what kind of stories become popular, trying to understand the trope and stuff like that. So I'm not sure if the audience will like...

Lenny RachitskyWell, what's one? What's one that taught you something about writing?

Chip HuyenI think like Yanxi Palace. It's a Chinese TV show.

Lenny RachitskyCool. Okay. I haven't heard that one on the podcast before. Okay, cool.

Chip HuyenYeah.

Lenny RachitskyNext question. Do you have a life motto that you often think about, come back to when you're dealing with something hard, whether it's in work or in life?

Chip HuyenThis sounds very nihilist. I think to say, in the end, nothing really matters. Usually, I think of in the grand scheme of things, in a billion years, nothing will, no one would ever be there. I think okay, someone will argue with me about that. . So my theory's like, in a billion years, none of us would ever exist. So whatever messy things, like crazy things we do or how bad we do it, I mean, no one would be remember, wouldn't be there to remember it. And I think in a way, it sounds scary, but it's very liberating because it just allows me say, okay, let's just try things out, right? Why does it matter? And there's a story of recently, so I have some family member who passed away recently. And I was talking to my dad because I couldn't be home for that.

I was asking my dad like, "Okay, os there anything I can do to make the person..." Something like comfort. So anything that you can get the persons. And my dad was just like, "What can he possibly want at this moment?" It just made me feel at the end of life, there's nothing that can bring you, like material can bring you joy. There's no money, no product, nothing. And in way, it makes me feel like, okay, what really do I really care about at the end of the day? So I guess it's like I think about it. It's just like, okay, maybe I fail it, maybe I don't get that contract. Maybe those things, but in the end of life, I don't think that actually really matters. So in a way, it's quite liberating.

Lenny RachitskyI know you said it might be nihilistic. This is what Steve Jobs shared too in one of his most famous speeches. Just we all die someday day, so don't take things so seriously and it is freeing. Absolutely. It just makes you appreciate every moment, every day you have. Just like, yeah, let's just do something hard and scary. Okay, final question. You talked about how you're writing a novel. Most people in tech have never written something creative and fiction. What's just one thing you learned in the process about how to write better stories, better fiction?

Chip HuyenA lot of time when we read, we get tripped up by some small things. So I think I want to do creative writing because I just want to go a better writer and it tells us maybe try a different audience could have me become better at anticipating what this different type of audience would want to hear and what they care about. So it's a way for me to get a... So I think if I write it or even any kind of content creations is about predicting the user's reactions, right?

Lenny RachitskyThe next token.

Chip HuyenYou do a podcast.

Lenny RachitskyJust kidding.

Chip HuyenYeah. Yeah, so you do a podcast, it's like, okay, what kind things that the users could find engaging, right? And I find this a little bit and a lot of companies you have launch a product, you have a narrative coming out and say, okay, how do we position this product in a way that users would want? So I feel like I have done technical writing for a while and I felt like I had some experience trying to predict what engineers would want to hear or care about. But then I don't have any experience like this, completely different type of audience. So that's what I want to, creative writing, writing a story. And that's why I was doing a lot of research . I mean, doing research enjoyed a lot, watching a lot of dramas. I just see what people like. So one thing that I care about is, I think I learned what emotional journey was from a editor.

So when we write something we care about how users would feel across a story. We want something in the beginning, we want something, we need to have a hook so that people continue reading. But we also don't want too much of drama because we'll get too tired because you're emotionally exhausted because it's like you're being emotionally manipulated a lot of time. So it gave a emotional journey, maybe have some climax or something more chill, maybe like... And also care about another thing I didn't realize is, for me, for technical writing, you entirely focus on the content, the argument. It's very impersonal. For example, people like ML compilers, doesn't matter if they like the person telling them about compiler or not because it's just objective . But for a novel, people care about character likeability.
So in the first version of my story, it makes the characters a little bit more, very logical, very rational, and just does everything just very rationally. And then the feedback I got is, I have a very good friend read it and he was like, he's an amazing person, he's a great person. And he was like, "Chip, I'll be honest, I hate that person." So it doesn't matter as a story, it's just like the person is so unlikeable, that's why he doesn't want to continue. So is a second version. It makes that person, the character more likable. How she makes that character more likable is that you put in some vulnerability sometimes it's like okay maybe it's person have setback because sometimes we can relate to it. So in a lot of ways, it's very interesting. A lot of it is about understand the emotional bit, like how the users feel, not just about the story but also about the characters.

Lenny RachitskyThat is so interesting. Wow. I learned a lot more there than I thought. That was awesome. Really good example. Chip, two final questions. Where can folks find you online, if they want to reach out and maybe work with you or maybe even just share the stuff that you offer if folks want to reach out. And then how can listeners be useful to you?

Chip HuyenI'm on social media, LinkedIn, Twitter. I don't post a lot, but I keep telling myself that I should do more because I kind of like the conversation with readers. So I'm actually about to I start a Substack. So I have a placeholder for Substack right now and I'm thinking of doing it for more system thinking because I think it's a very interesting skill. I'm also thinking of doing a YouTube channel on book reviews and basically books than help you think better. So I think it's the first book I'm a review is probably like this book because it's my favorite book growing up and I've been keep on reading it. So yeah, so how can you be helpful? Send me books that you like, books that help you have changed the way you think or change you the way you do anything. So I would appreciate it.

Lenny RachitskyAmazing. I'm excited to read that book.

Chip HuyenMm-hmm.

Lenny RachitskyChip, thank you so much for being here.

Chip HuyenThank you so much, Lenny, for having me.

Lenny RachitskyBye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

章节 02 / 09

第02节

中文 译稿已完成

Lenny Rachitsky大家老是问一个问题:“怎么跟上最新的 AI 新闻?” 可是你为什么一定要跟上最新 AI 新闻呢?如果你多去跟真正理解自己想要什么、不想要什么的用户聊,再看看反馈,你其实能把应用做得好得多。
很多公司都在做 AI 产品,只是很多公司做得并不轻松。

Chip Huyen我们现在正处在一种理想的危机里。手上有这么多很酷的工具,可以从零开始做一切,重新设计,甚至让你写代码、搭网站。理论上应该会看到更多成果,但现实是,很多人反而卡住了,不知道该做什么。

Lenny Rachitsky这股 AI 热潮里,数据其实已经说明了,大多数公司试了一下,效果没多大,然后就停了。你觉得这里面的差距到底在哪?

Chip Huyen生产力真的很难衡量。所以我经常会问大家去问自己的经理:你更愿意给团队里每个人配一个很贵的 coding agent 订阅,还是多一个人头?几乎每次,经理都会说要人头。但如果你去问 VP,或者那种管很多团队的人,他们就会说,要 AI 助手。因为作为经理,你还在成长阶段,对你来说,多一个人头是很大的事;但对高管来说,你会看更多业务指标,所以你会去想,到底什么才真正驱动你的生产力指标。

Lenny Rachitsky今天的嘉宾是 Chip Huyen。和很多只会分享怎么做出优秀 AI 产品、以及未来会怎样的人不同,Chip 自己已经做成过多个成功的 AI 产品、平台和工具。她曾是 NVIDIA NeMo 平台的核心开发者,也在 Netflix 做过 AI 研究员,还在斯坦福教过机器学习。她既是两次创业者,也是 AI 世界最受欢迎的两本书作者之一,其中最新那本叫《AI Engineering》,自从上架以来,一直是 O'Reilly 平台上最常被阅读的书。

她还和很多企业一起做过 AI 战略,所以能看到很多不同公司内部,地面上到底在发生什么。我们这次会聊很多基础概念,比如预训练和后训练到底分别是什么,RAG 是什么,强化学习是什么,RLHF 又是什么。我们还会深入她这些年学到的、关于怎么做出优秀 AI 产品的一切,包括大家以为需要什么、以及真正需要什么。我们会聊公司最常踩的坑、她看到的最大生产力提升来自哪里,还有很多很多。
本期节目由 Dscout 赞助。今天的设计团队既要快,又要做对,这正是 Dscout 的用武之地。Dscout 是一个面向现代产品和设计团队的一体化研究平台。不管你是在做可用性测试、访谈、问卷,还是实地调研,它都能帮你快速接触真实用户,快速拿到真实洞察。你甚至可以直接在平台里测试 Figma 原型。不用东拼西凑不同工具,不用追着“幽灵参与者”跑,而且借助业内最受信赖的样本库,再加上 AI 驱动的分析,你的团队能更清楚、更有把握地做出更好的产品,而且不会拖慢节奏。想让研究流程更顺、决策更快、设计更有影响力,就去 dscout.com 看看。D-S-C-O-U-T.com,那里有你需要的答案,帮你更稳地向前推进。

Chip Huyen嗨,Lenny。我已经听这档播客很久了,能来这里我真的特别兴奋。谢谢你的邀请。

Lenny Rachitsky我想先聊聊你之前在 LinkedIn 上分享过的那张表/图,它当时特别火,我觉得是因为它戳中了很多人的痛点。先让我读一下,YouTube 上的观众也能看到。那张很简单的表里,你列的是“大家以为会改善 AI 应用的东西”和“真正会改善 AI 应用的东西”。大家以为会改善 AI 应用的,包括跟上最新 AI 新闻、采用最新的 agentic 框架、纠结到底该用哪个向量数据库、不断评估哪个模型更聪明、对模型做微调。真正会改善 AI 应用的,则是和用户交流、搭建更可靠的平台、准备更好的数据、优化端到端工作流、写出更好的提示词。你觉得这为什么会这么戳人?如果要一句话概括,你觉得大家在做成功的 AI 应用时最容易忽略的是什么?

Chip Huyen这个问题经常被问到。大家老是问:“我怎么才能跟上最新的 AI 新闻?”我就会想:为什么一定要跟上最新 AI 新闻?我知道这听起来有点反直觉,但外面新闻实在太多了。还有很多人会问我类似这样的问题:我该怎么在两种技术之间做选择?比如最近的 MCP 和 agent-to-agent protocol,到底哪个好?我觉得真正该问的是:第一,从最优解到非最优解,到底能带来多少提升?对吧?有时候他们会说,其实提升没多少。

那我就会说:如果提升不大,为什么还要花那么多时间去争论一件对性能影响没那么大的事?他们接着又会问:如果采用一项新技术,切换到别的技术会有多难?有时候他们会说,哦,我觉得切换成本会很高。于是我就会想:假设这里有一项新技术,它还没被很多人验证过,一旦用了就可能永远被它绑住。那你真的还想采用它吗?对那些还没被充分验证的新技术,或许你需要三思,不要过早地过度承诺。

Lenny Rachitsky我很喜欢你更宏观的建议,其实特别简单:要做出成功的 AI 应用,就去跟用户聊、把数据做好、把提示词写好、把用户体验优化好,而不是只盯着“最新最强的是啥”“现在该用哪个模型”“AI 现在又发生了什么”。我顺着你刚才提到的微调、也就是后训练这条线继续问下去。AI 里有一堆术语,大家常听到,但未必真的懂。我觉得这会是个很好的机会,帮大家弄清楚我们到底在说什么,毕竟你就是做这些的人,你做过这些系统,也和很多公司一起做这些事。接下来我会穿插几个术语,但先从这个开始。最简单地说,预训练和后训练有什么区别?微调又是怎么嵌进去的?微调到底是什么?

Chip Huyen先声明一下,我并不能完全看到那些神秘的前沿实验室到底在做什么。不过按我听到的说法,大概是这样:一种是监督微调,也就是你有示范数据,还有一批专家会给出样例,“好,这里是一个 prompt,这里是应该怎么回答。”你就训练模型去模仿人类专家可能会给出的答案。很多人其实想要的也是这个,所以开源模型会用蒸馏的方式来做。也就是说,不是让人类专家去写很棒的 prompt 答案,而是让那些很有名、表现很好的模型先生成回答,再把这些回答拿来训练更小的模型去模仿。

有时候你会看到一些人说……我其实特别感谢开源社区。但从“能训练出一个模仿现有好模型的模型”这一步,和“自己训练出一个好模型”是完全不同的两回事。前者已经是往前迈了一大步了。对,就是监督微调。还有一件很大的事,虽然我不确定你之前有没有嘉宾聊过,但强化学习现在到处都是。

Lenny Rachitsky这个我们等一下要停一下,我肯定想多聊聊,因为这是我最近对话里越来越常出现、也特别有意思的话题。先总结一下你刚才说的这些,我觉得真的很重要。大致意思是,模型本身就是一段代码,加上一整套权重,也就是一个统计模型,它学会了在某些词和短语之后预测接下来会发生什么。然后前沿模型会拿着几乎整个互联网的内容去喂它,本质上是在测试自己跨这些数据去预测下一个词。更准确的说法是 token,但更容易理解的方式就是文本里的下一个词。每次它猜错了,就会调整这些叫作权重的东西。这样理解,算是一个比较表面的说法吗?

Chip Huyen我会把语言建模看作一种对语言中的统计信息进行编码的方式,对吧。比如我们都说英语,我们会对什么更符合统计规律有一种感觉。比如我说“我最喜欢的颜色是”,你就会想:好,后面应该接一种颜色。blue 出现的概率会比 like 这种词大得多,对吧?因为从统计上看,在“我最喜欢的颜色是”后面,blue 更有可能出现。它本质上就是在编码统计信息。

所以当你用大量数据训练语言模型时,你会见到很多语言、很多领域。它就能判断,嗯,这里基本是标准英语。然后用户一输入 prompt,它就会给出下一个最可能的 token。顺便说一句,这其实不是个新想法。这个思路很早很早就有了,早到 1951 年的《The English Entropy》之类的论文,我记得作者是 Claude Shannon,那真是篇很棒的论文。我特别喜欢里面一个故事,来自……你看过《福尔摩斯》吗?

Lenny Rachitsky看过几本,是的。

Chip Huyen对。故事里福尔摩斯就是用这种统计信息来破案。大概是有人留下了一串很多小棍人的消息。福尔摩斯就想,既然英语里最常见的字母是 E,那这些小棍人里最常见的那个肯定就是 E。然后他就开始推……后面他怎么推的我记不太清了,反正就是这样。所以这其实就是语言建模。只不过福尔摩斯是在字符层面做这件事,而 token 介于单词和字符之间,对吧?token 不完全是词,但比字符大。所以我们说 token,是因为它能帮我们缩小词表。现在字符表最小,只有 26 个字母,但单词可能有几百万、几千万,对吧?而 token 正好可以在两者之间找到一个平衡点。

比如我们造一个新词,怎么说呢,像 podcasting,对吧。假设这是个新词,但它可以拆成 podcast 和 ing。大家就会明白,podcast 我们知道是什么意思,ing 我们知道是动词、动名词之类的。我们甚至能理解 podcasting 这个词,所以 token 的作用就在这里。对,预训练本质上就是把语言里的统计信息编码进去,让你去预测什么最可能出现。我觉得“最可能”是最简单的理解方式,因为它更像是在建立一个分布:好,下一个 token 90% 的时候可能是某种颜色,10% 的时候可能是别的东西。也就是说,本质上是一个分布,语言模型会根据你的采样策略来选。你是想永远选最可能的 token,还是想让它选得更有创造性?我觉得采样策略非常非常重要,但它其实被严重低估了,它能在很大程度上提升表现。

Lenny Rachitsky明白了,太好了。所以本质上,模型就是一段代码,加上这一整套权重,也就是它学会了根据某些词和短语去预测下一个会出现什么的统计模型?

Chip Huyen对。

Lenny Rachitsky那后训练,尤其是微调,做的其实也是同样的事。预训练是先得到 GPT-5,微调就是有人拿着 GPT-5,再做同样的事情,只不过是在特定用例的数据上,把这些权重稍微调整一下,让它更适合他们非常具体的场景。这样理解对吗?

Chip Huyen对,我觉得权重其实就像函数,对吧。比如你可以把 Lenny 的身高写成 1X 加上某个值,或者 2X 加上某个值,而这个“加上某个值”就是权重,对吧。你不断调整它,直到它能拟合正确的数据,也就是我的身高、你的身高。你可以把权重理解成权重,也可以理解成函数。你训练和调整这些权重,就是为了让它们去拟合数据,也就是训练数据。

Lenny Rachitsky太棒了。好,我们聊了预训练、后训练、微调。这里还有没有别的东西也很重要,需要大家知道?关于这些训练部分,大家还应该理解什么?

Chip Huyen其实大多数时候,我们根本不会碰预训练模型。作为用户,我们压根儿不会直接用到它。

Lenny Rachitsky对,它已经替我们做完了。

Chip Huyen是啊。我自己的一个小乐趣是看朋友训练模型时去玩他们的预训练模型,结果他们就会表现得特别离谱,说些像……哎呀,太夸张了。去看后训练到底能把模型行为改到什么程度,其实特别有意思。我觉得现在前沿实验室里,大家花很多精力的地方就是后训练。因为预训练,我觉得一直是用来提升模型的通用能力。它需要海量数据和更大的模型,才能继续提升能力。而到了某个阶段,我们其实已经把互联网数据差不多用到头了,文本数据也接近上限。现在很多人会转向音频、视频这些别的数据,大家都在想新的数据源从哪里来。但在后训练这块,由于大家的预训练数据其实很接近,后训练如今反而成了真正拉开差距的地方。

Lenny Rachitsky这正好可以接到你刚才讲的监督学习和无监督学习。我很喜欢我们聊到这里,真的特别有意思。你说到有标签的数据。基本上,监督学习就是 AI 在已经被人标好、告诉它什么是对、什么是错的数据上学习。比如这是垃圾邮件、那不是垃圾邮件。这个是好短篇小说,这个不是好短篇小说。我们节目里来过不少做这类工作的公司 CEO,像 Mercor、Scale、Handshake、Micro,还有别的几家。所以本质上,这些公司就是在给前沿实验室提供标注好的高质量数据,让它们去训练吗?

Chip Huyen某种程度上是,但我觉得它更像是更大方程里的一个组成部分。里面还有很多别的环节,这也是我刚才提强化学习的原因。我不确定你之前采访过的 CEO 有没有提到这个词。大概思路是,一旦你……假设你有一个模型,给它一个 prompt,它输出一个结果。你会想办法强化、鼓励模型产出更好的结果。那问题就来了:我们怎么知道答案是好是坏?通常人们会依赖信号。最直接的方式之一,就是人类反馈。人类会看到两个回答,然后说:好,这个比那个好。之所以这么做,是因为作为人类,我们很难打出一个特别具体的分数,但做比较会容易得多。

如果你让我给一首歌打分,我又不是音乐人,也不知道这有多难。我可能会想,哎呀,给几分好呢,6 分?然后一个月后你再问我,我可能已经完全忘了,嗯,也许现在是 7 分,或者 4 分,我也不知道。可如果你问我,这两首歌里你更想在生日派对上放哪一首?我就能说:好,我更喜欢这首。所以比较要容易得多。所以你会有一个人类反馈,拿这个人类反馈去训练 reward model,让它判断哪一个更好,然后 reward model 再帮你……好,模型产出了这个回答,reward model 就可以打分,看看这个好不好。然后你会朝着产出更好回答的方向去偏移它。另一种方式是,不用人类,也可以用 AI 来判断回复是好是坏。还有一个现在特别火的东西叫 verifiable rewards,也很自然。比如给它一道数学题,模型会给出一个解法。好,标准答案应该是 42,如果它没给 42,那就是错的,那就不是好回答。是的,很多时候,人们会用这些人类劳动力去产出,怎么说呢,专家问题和期望答案;而在可验证的系统里,模型也可以据此训练。对,就是这样。

Lenny Rachitsky好,我很高兴你讲到了这里。这基本上就是 RLHF,也就是带有人类反馈的强化学习,这正是我也想聊的,对吧?

Chip Huyen对。我觉得更一般地说,它就是一种学习方式,本质上是一种训练方法。至于它是从人类反馈、AI 反馈,还是从可验证奖励里学习,我觉得只是信号来源不同而已。

Lenny Rachitsky太好了。我们在播客里采访过 Anthropic 的 CEO,他聊过他们版本的 RLHF,也就是 AI 驱动的强化学习。我很喜欢你刚才的说法:本质上,你是在帮模型强化正确的行为和正确的答案,而这就是实现它的方法。不管是工程师看到模型输出后说:“不对,我会这样写代码。”然后训练另一个和原模型配合的模型,去判断我是不是正确,大概是这样理解吗?

Chip Huyen对。

Lenny Rachitsky明白。

Chip Huyen我觉得可以这么看。现在这个领域特别让人兴奋,因为有太多领域专家的任务,模型开发者都希望模型能在上面表现得更好,对吧。比如你是会计,也许你想让模型处理会计任务,那就需要很多会计相关的数据样本、很多会计专家的数据。你就得雇很多人,或者说,现在人人都想做物理题、做法律题、做工程题,甚至有人跟我说,他们想用编程来做科学问题,而不只是用编程来做产品,这又是另一个完全不同的世界。而且这些工具也都非常具体。我不知道你平常用什么软件,也许是某个 app、QuickBooks,或者 Google Excel。它们都有很具体的工具和很具体的专业知识。所以你会希望模型也学会这些。

这就需要很多这个领域里的专家,人们要拿这些专家来产出数据,去训练模型。而且这件事规模很大,因为每个人都想要大量数据,也都想要无限预算。但我觉得这里还有一点很有意思,某种程度上是个很有意思的经济学问题。我不知道你有没有跟嘉宾聊过这个,我觉得特别值得想一想,因为这个结构其实很失衡,对吧?因为前沿实验室数量很少,却需要大量数据;而与此同时,提供相关数据的创业公司或企业却非常多。所以你会看到那种做数据标注的公司,它们可能 ARR 增长非常猛,但你如果问它们:“好,那你们有多少客户?”客户数可能非常少,我也说不准。反正我看你刚才笑了。

Lenny Rachitsky对对对,我们聊过这个。

Chip Huyen所以我会有一点……怎么说呢,不太踏实。看到一家公司的增长很疯狂,但它高度依赖两三家公司。与此同时,如果我是前沿实验室,我在经济上最合理的做法是什么?我当然想要很多创业公司,我想要很多供应商,这样我可以挑来挑去,而且这些供应商之间也能互相竞争,压低价格;但无论如何,这种依赖关系都非常强。所以我觉得,这整套经济逻辑特别有意思,我也很想看看最后会怎么发展。

Lenny Rachitsky我听到的是,你对这些数据标注公司的未来是偏谨慎的,因为正如你说的,它们对定价没有太强的议价能力,客户又少,进入这个赛道的人又很多。所以,虽然这些公司是世界上增长最快的一批,你还是觉得前面有挑战。

Chip Huyen我不确定自己是不是看空它们。我只是觉得好奇,因为事情往往会以我没预料到的方式发展。也许这些公司手里有大量数据,它们也许能从中挖出一些洞见,帮它们继续保持领先。所以我也不确定。

English No English text found
No English transcript text was found for this chapter.
章节 03 / 09

第03节

中文 译稿已完成

Lenny Rachitsky对,我看过几本《福尔摩斯》。

Chip Huyen对。这里讲的是福尔摩斯如何用这种统计信息来破案。故事里有人留下一串全是小棍人的信息。福尔摩斯就想:既然英语里最常见的字母是 E,那这些小棍人里最常见的那个也一定是 E。然后他就顺着这个思路推下去,后面的细节我记不太清了,大概就是这样。这个例子其实就是语言建模。只不过福尔摩斯是在字符层面做这件事,而 token 介于单词和字符之间,对吧?token 不完全是词,但比字符大。所以我们会用 token,因为它能帮我们缩小词表。现在字符表只有 26 个字母,但单词的数量可能有几百万、几千万,对吧?而 token 正好能在两者之间找到一个平衡点。

比如说,我们造一个新词,像 podcasting,对吧。假设这是个新词,但它可以拆成 podcast 和 ing。这样大家就能明白,podcast 我们知道是什么意思,ing 我们知道是动词、动名词之类的。我们甚至能理解 podcasting 这个词,所以 token 的作用就在这里。对,预训练本质上就是把语言里的统计信息编码进去,让你去预测什么最可能出现。我觉得“最可能”是最简单的理解方式,因为它更像是在建立一个分布:好,下一个 token 90% 的时候可能是某种颜色,10% 的时候可能是别的东西。也就是说,本质上是在建一个分布,语言模型会根据你的采样策略来选。你是想永远选最可能的 token,还是想让它选得更有创造性?我觉得采样策略非常重要,但它其实被严重低估了,它能在很大程度上提升表现。

Lenny Rachitsky明白了,太好了。所以本质上,模型就是一段代码,加上这一整套权重,也就是它学会了根据某些词和短语去预测下一个会出现什么的统计模型?

Chip Huyen对。

Lenny Rachitsky那后训练,尤其是微调,做的其实也是同样的事。预训练是先得到 GPT-5,微调就是有人拿着 GPT-5,再做同样的事情,只不过是在特定用例的数据上,把这些权重稍微调整一下,让它更适合他们非常具体的场景。这样理解对吗?

Chip Huyen对,我觉得权重其实就像函数,对吧。比如你可以把 Lenny 的身高写成 1X 加上某个值,或者 2X 加上某个值,而这个“加上某个值”就是权重。你不断调整它,直到它能拟合正确的数据,也就是我的身高、你的身高。你可以把权重理解成权重,也可以理解成函数。你训练和调整这些权重,就是为了让它们去拟合数据,也就是训练数据。

Lenny Rachitsky太棒了。好,我们聊了预训练、后训练、微调。这里还有没有别的东西也很重要,需要大家知道?关于这些训练部分,大家还应该理解什么?

Chip Huyen其实大多数时候,我们根本不会碰预训练模型。作为用户,我们压根儿不会直接用到它。

Lenny Rachitsky对,它已经替我们做完了。

Chip Huyen是啊。我自己的一个小乐趣是看朋友训练模型时去玩他们的预训练模型,结果他们就会表现得特别离谱,说些像……哎呀,太夸张了。去看后训练到底能把模型行为改到什么程度,其实特别有意思。我觉得现在前沿实验室里,大家花很多精力的地方就是后训练。因为预训练,我觉得一直是用来提升模型的通用能力。它需要海量数据和更大的模型,才能继续提升能力。而到了某个阶段,我们其实已经把互联网数据差不多用到头了,文本数据也接近上限。现在很多人会转向音频、视频这些别的数据,大家都在想新的数据源从哪里来。但在后训练这块,由于大家的预训练数据其实很接近,后训练如今反而成了真正拉开差距的地方。

Lenny Rachitsky这正好可以接到你刚才讲的监督学习和无监督学习。我很喜欢我们聊到这里,真的特别有意思。你说到有标签的数据。基本上,监督学习就是 AI 在已经被人标好、告诉它什么是对、什么是错的数据上学习。比如这是垃圾邮件、那不是垃圾邮件。这个是好短篇小说,这个不是好短篇小说。我们节目里来过不少做这类工作的公司 CEO,像 Mercor、Scale、Handshake、Micro,还有别的几家。所以本质上,这些公司就是在给前沿实验室提供标注好的高质量数据,让它们去训练吗?

Chip Huyen某种程度上是,但我觉得它更像是更大方程里的一个组成部分。里面还有很多别的环节,这也是我刚才提强化学习的原因。我不确定你之前采访过的 CEO 有没有提到这个词。大概思路是,一旦你……假设你有一个模型,给它一个 prompt,它输出一个结果。你会想办法强化、鼓励模型产出更好的结果。那问题就来了:我们怎么知道答案是好是坏?通常人们会依赖信号。最直接的方式之一,就是人类反馈。人类会看到两个回答,然后说:好,这个比那个好。之所以这么做,是因为作为人类,我们很难打出一个特别具体的分数,但做比较会容易得多。

如果你让我给一首歌打分,我又不是音乐人,也不知道这有多难。我可能会想,哎呀,给几分好呢,6 分?然后一个月后你再问我,我可能已经完全忘了,嗯,也许现在是 7 分,或者 4 分,我也不知道。可如果你问我,这两首歌里你更想在生日派对上放哪一首?我就能说:好,我更喜欢这首。所以比较要容易得多。所以你会有一个人类反馈,拿这个人类反馈去训练 reward model,让它判断哪一个更好,然后 reward model 再帮你……好,模型产出了这个回答,reward model 就可以打分,看看这个好不好。然后你会朝着产出更好回答的方向去偏移它。另一种方式是,不用人类,也可以用 AI 来判断回复是好是坏。还有一个现在特别火的东西叫 verifiable rewards,也很自然。比如给它一道数学题,模型会给出一个解法。好,标准答案应该是 42,如果它没给 42,那就是错的,那就不是好回答。是的,很多时候,人们会用这些人类劳动力去产出,怎么说呢,专家问题和期望答案;而在可验证的系统里,模型也可以据此训练。对,就是这样。

Lenny Rachitsky好,我很高兴你讲到了这里。这基本上就是 RLHF,也就是带有人类反馈的强化学习,这正是我也想聊的,对吧?

Chip Huyen对。我觉得更一般地说,它就是一种学习方式,本质上是一种训练方法。至于它是从人类反馈、AI 反馈,还是从可验证奖励里学习,我觉得只是信号来源不同而已。

Lenny Rachitsky太好了。我们在播客里采访过 Anthropic 的 CEO,他聊过他们版本的 RLHF,也就是 AI 驱动的强化学习。我很喜欢你刚才的说法:本质上,你是在帮模型强化正确的行为和正确的答案,而这就是实现它的方法。不管是工程师看到模型输出后说:“不对,我会这样写代码。”然后训练另一个和原模型配合的模型,去判断我是不是正确,大概是这样理解吗?

Chip Huyen对。

Lenny Rachitsky明白。

Chip Huyen我觉得可以这么看。现在这个领域特别让人兴奋,因为有太多领域专家的任务,模型开发者都希望模型能在上面表现得更好,对吧。比如你是会计,也许你想让模型处理会计任务,那就需要很多会计相关的数据样本、很多会计专家的数据。你就得雇很多人,或者说,现在人人都想做物理题、做法律题、做工程题,甚至有人跟我说,他们想用编程来做科学问题,而不只是用编程来做产品,这又是另一个完全不同的世界。而且这些工具也都非常具体。我不知道你平常用什么软件,也许是某个 app、QuickBooks,或者 Google Excel。它们都有很具体的工具和很具体的专业知识。所以你会希望模型也学会这些。

这就需要很多这个领域里的专家,人们要拿这些专家来产出数据,去训练模型。而且这件事规模很大,因为每个人都想要大量数据,也都想要无限预算。但我觉得这里还有一点很有意思,某种程度上是个很有意思的经济学问题。我不知道你有没有跟嘉宾聊过这个,我觉得特别值得想一想,因为这个结构其实很失衡,对吧?因为前沿实验室数量很少,却需要大量数据;而与此同时,提供相关数据的创业公司或企业却非常多。所以你会看到那种做数据标注的公司,它们可能 ARR 增长非常猛,但你如果问它们:“好,那你们有多少客户?”客户数可能非常少,我也说不准。反正我看你刚才笑了。

Lenny Rachitsky对对对,我们聊过这个。

Chip Huyen所以我会有一点……怎么说呢,不太踏实。看到一家公司的增长很疯狂,但它高度依赖两三家公司。与此同时,如果我是前沿实验室,我在经济上最合理的做法是什么?我当然想要很多创业公司,我想要很多供应商,这样我可以挑来挑去,而且这些供应商之间也能互相竞争,压低价格;但无论如何,这种依赖关系都非常强。所以我觉得,这整套经济逻辑特别有意思,我也很想看看最后会怎么发展。

Lenny Rachitsky我听到的是,你对这些数据标注公司的未来是偏谨慎的,因为正如你说的,它们对定价没有太强的议价能力,客户又少,进入这个赛道的人又很多。所以,虽然这些公司是世界上增长最快的一批,你还是觉得前面有挑战。

English No English text found
No English transcript text was found for this chapter.
章节 04 / 09

第04节

中文 译稿已完成

Lenny Rachitsky这个回答很公平。好,既然聊到这里,我们顺着讲一下 evals。这个话题在这档播客里出现得特别频繁。这也是这些公司会提供给 AI labs 的另一类关键数据内容。你能不能先用最简单的方式讲讲,eval 到底是什么?它为什么重要?又是怎么帮助模型变聪明的?

Chip Huyen我觉得大家谈 eval,其实是在解决两类完全不同的问题。第一类是应用开发者:我做了一个 app,比如聊天机器人,我怎么判断它好不好?这是最直接的 eval。第二类是任务导向的 eval 设计。比如我是模型开发者,我想让模型更擅长写代码。问题就变成了:我到底怎么衡量“写代码”这件事?

所以我需要找懂代码的人,去想什么样的代码才算好,然后设计整套数据集和评估标准,来评价代码写作能力。我觉得这里面最有意思的,其实是 eval design,也就是怎么制定标准、怎么写规则、再怎么训练别人按照这个标准去评估。这个过程很有创意。我看过很多别人做出来的 eval,真的会觉得:“哇,这一点都不枯燥,反而特别有意思。”

Lenny Rachitsky我们之前专门做过一期 eval 的播客,嘉宾是 Hamel 和 Shreya。他们说的也正是这个,尤其是给公司做 eval,真的很好玩。那我还想继续追问一下。网上一直有个争论,我也不确定这争论到底有多大,但很多人确实花很多时间想这个问题:AI 产品到底需不需要 eval?有些最好的公司会说他们根本不做 eval,只靠感觉。就是“这玩意儿好不好用?我能不能感受到?”你怎么看 eval 的重要性?尤其是对 AI 应用,不是模型公司。

Chip Huyen我觉得你不需要绝对完美,才有可能赢。你只需要足够好,而且保持稳定就行。这个说法不是我的信条,但我确实跟很多公司合作过,也看到了它是怎么发生的。所以如果问我,为什么有些公司不做 eval,通常是因为他们已经有一个 use case 了,而且它看起来运作得不错,客户也还算满意,只是没有一个特别精确的指标。

然后流量还在涨,大家看起来也挺开心,买单的人也还在买。这个时候工程师就会说:“我们得给它做 eval。” 接着经理就会问:“那做 eval 要花多少人力?” 工程师会说,大概要两个工程师,差不多这么多。然后他们再估算,这件事也许能让效果提升一点,于是经理又会问:“那预期收益有多少?” 工程师可能会说,可能从 80% 提到 82%,或者 85%。
这时候经理就会想:可我拿这两个工程师去做新功能,可能带来的收益大得多。所以很多时候,eval 会被看成“已经够好了,别动它”。如果你在 eval 上投入太多,最后得到的也可能只是一些增量优化;而这些精力放到别的 use case 上,说不定更值,甚至那个 use case 已经好到你只需要 vibe check 一下就行。
所以我觉得大家争论的其实就是这个。很多时候,团队会把产品做到“嗯,够用了”,然后就先上线。可风险也确实存在,因为如果你没有清晰指标,就很难看清应用或模型到底表现如何,最后可能会做出很蠢的事,甚至真的出大乱子。所以如果你的系统规模很大,而失败后果又很严重,那 eval 就特别重要。
这时你就得对摆到用户面前的东西非常严苛,弄清楚不同的 failure mode,哪里可能出问题。如果你的产品本身就是竞争优势,那你还要非常清楚自己现在处在什么位置,也要知道和竞争对手相比如何。但如果它只是一个没那么核心、只是对用户有帮助的东西,那可能就不需要过度执着、过度理论化。可以先觉得:“好,暂时够用了;如果之后出问题,那再说。”
我知道这听起来有点吓人,但归根结底还是一个投资回报的问题。我自己是 eval 的重度粉丝,也很喜欢看 eval。不过我也理解,为什么有些人会选择先不把重心放在 eval 上,而是先去做新功能。

Lenny Rachitsky这个回答真的很务实。我的理解是:eval 很棒,也很重要,尤其当你在做大规模产品时,但你得挑重点,不是每个小功能都要写 eval。Hamel 和 Shreya 说过一个很有意思的点,他们觉得大概只要五到七个 eval,就能覆盖产品最重要的部分。你看到的情况也是这样吗?还是说在生产环境里,大家会做更多?

Chip Huyen我不会把 eval 只看成一个固定数量的问题。eval 的目标是什么?它的目标是指导产品开发。所以我很看重 eval 的原因,是它能帮你发现哪里做得好,哪里还有机会改进。有时候我们会看到非常明显的情况:一看 eval 就发现,某个用户群体的表现特别差。然后我们再往下看,才发现问题其实是我们对这部分用户的表达或信息传达做得不够好。所以我更倾向于关注那些做得不好的地方,因为那里往往有最大的提升空间。

所以 eval 的数量真的要看情况。我们见过一些产品有上百个不同的指标,真的会让人有点疯狂,因为那类产品很通用,维度很多。比如一个 eval 看冗长程度,一个看用户敏感数据,一个看长度。再举个更具体的例子,比如 deep research。你给它一个 prompt,让模型帮你做深度研究。假设你让它只研究 Lenny’s Podcast,帮你出一份报告:Lenny 对什么话题感兴趣?哪些视频最可能爆?他漏掉了哪些应该覆盖的话题?那你该怎么评估这个结果?
我觉得没有一个单一指标能完全搞定。也许有人会做 benchmark,找一百个专家写一堆 prompt,然后看 AI 的答案,这当然可以,但成本极高,也很慢。我后来和朋友聊这个问题时,在想,既然目标是生成总结,那你实际上得先收集信息。为了收集信息,你要做很多搜索查询,然后把搜索结果汇总起来,再发现自己还缺点什么,于是再换一条路线,最后才得到总结。所以每一步都需要评估,不是只看端到端结果。
比如一开始你可能会写五个 search query,那就要看看这些 query 写得好不好,会不会太像了。因为如果五个 query 都差不多,比如都是“Lenny Podcast”“上个月的 Lenny Podcast”“两个月前的 Lenny Podcast”,那就没什么意思。但如果 query 更丰富,比如“podcast”这个词更宽一点,关键词更分散一些,结果就会更有意思。你再看搜索结果,假设你搜 “Lenny Podcast data labeling” 出来 10 页结果,再换成 “Lenny Podcast frontier labs” 又是另一组 10 页结果,那就要看这些页面之间有没有重叠,是不是既有广度也有深度,同时还要看相关性。因为如果 query 跑偏了,结果跟原始 prompt 完全不相关,那就没意义。
所以我觉得每一个环节都需要评估。问题不是“我应该做几个 eval”,而是“为了让我的应用有足够高的覆盖率和置信度,同时知道哪些地方没做好,以便修正,我需要多少个 eval”。

Lenny Rachitsky明白了。我听出来的另一个重点是,尤其是你的核心 use case,也就是用户在产品里最常走的那条路径,应该把关注点放在这里。

Chip Huyen对,对。

Lenny Rachitsky好。我还想再补一个术语,顺着另一个方向聊。RAG,这个词大家经常看到,R-A-G。它到底是什么意思?

Chip HuyenRAG 是 Retrieval-Augmented Generation,也就是检索增强生成,不是什么特别玄的生成式 AI。它的核心想法很简单:很多问题都需要上下文才能回答。这个概念大概在 2017 年的论文里就出现了。大家后来发现,在一堆问答 benchmark 里,如果给模型补上问题相关的信息,答案质量会好很多。所以他们做的事情就是去 Wikipedia 之类的地方检索信息,把它放进上下文里再让模型回答,效果会好很多。

这听起来好像是废话,对吧,当然应该这样。所以最简单地说,RAG 就是给模型一个相关上下文,让它能把问题答好。真正有意思的地方在于,RAG 最开始主要还是文本场景。

Chip Huyen好的,那我们接下来就会讲很多数据准备的方法,看看怎么让模型更有效地检索。比如说,不是所有东西都像 Wikipedia 页面。Wikipedia 很封闭,主题也很清楚,你基本知道一页里讲的就是一个主题。但很多时候,你拿到的是一堆结构很奇怪的文档。比如假设你有一批关于 Lenny Podcast 的文档,在未来某个时间点,文档开头会写:“从现在开始,podcast 不再指 Lenny’s Podcast 了。” 那如果之后有人问:“讲讲 Lenny,讲讲 Lenny 的工作。” 因为文档里没有直接出现 “Lenny” 这个词,你就可能检索不到它。再比如文档很长,被切成了不同 chunk,第二部分里没有 “Lenny” 这个词,那你同样找不到它。所以你必须想办法处理数据,让信息即便没有显式出现,也能被检索出来,而且确实和查询相关。

English No English text found
No English transcript text was found for this chapter.
章节 05 / 09

第05节

中文 译稿已完成

所以大家会想办法做各种处理,比如给每个 chunk 加上下文可视化信息,或者加一段摘要、元数据,让模型知道它们之间的关系。还有人会用假设性问题。这个思路很有意思:就算只是某个文档 chunk,我也可以先生成一堆这个 chunk 能回答的问题。这样当真正有 query 进来时,我就能看它匹不匹配这些假设问题,从而把它取出来。这个方法很有意思。
在我继续往下讲之前,我想先强调一句:RAG 的数据准备极其重要。在我见过的很多公司里,RAG 方案里提升最大的,往往不是纠结该用哪个向量数据库,而是把数据准备做好。当然,向量数据库也重要,特别是你要考虑 latency,或者你的访问模式是读多写少、写多读少,这些都很关键。但如果只看答案质量,我觉得真正拉开差距的还是数据准备。

Lenny Rachitsky你说的数据准备,能不能举个具体例子?这样我们更容易理解。

Chip Huyen其中一种做法就是处理 chunk。你得先想,每个 chunk 应该多大。因为你要最大化的是上下文利用率。举个很简单的例子,你想检索一千个词。如果一个 chunk 太长,它可能包含更多相关元数据,所以更容易被检索出来;但如果它太长,一个 chunk 可能就已经有一千个词了,那你一检索到它,其他内容就没空间了,反而不太有用。可如果 chunk 太短,你确实能检索到更多相关信息,也能覆盖更多文档,但同时每个 chunk 太小,信息又不够完整。

所以 chunk 的设计要很讲究,得平衡大小。除了 chunk 大小,你还可以加上下文信息,比如摘要、元数据、假设性问题。我听人说过一个很有效的方法:他们把自己的数据改写成问答格式,效果提升很大。比如原来只是把播客内容切块,他们会把它重写成“这里是一个问题、这里是答案”,然后生成很多这样的条目。这个过程也可以用 AI 来做。
这也是一种数据处理方式。我看到的另一个常见例子,是大家用 AI 去帮助特定使用场景和文档。我们现在很多文档,原本都是写给人看的,但 AI 读文档的方式跟人不一样。人类有常识,能自己补很多上下文;即便是人类专家,也会有 AI 没有的背景知识。
有人跟我说过,一个很大的变化是这样:假设你有一个函数,库文档里写着它的输出是某个值,比如某种图上的温度,可能是 1 或 -1。人类专家可能知道这个尺度里 1 代表什么,但 AI 不一定知道。所以他们会给 AI 再加一层注释,比如“这个温度值里的 1 并不是真实温度,它只是对应这个图上的尺度”。把这些数据都处理好,才能让 AI 更容易检索到相关信息,并正确回答问题。

Lenny Rachitsky本期节目由 Persona 赞助。Persona 是一个身份验证平台,帮助企业完成用户注册、打击欺诈、建立信任。我们在这档播客里经常聊 AI 带来的惊人进步,但它也是一把双刃剑。每一个让人惊叹的时刻,背后都可能有骗子在用同样的技术搞破坏,比如洗钱、盗用员工身份,或者冒充企业。Persona 通过自动化的用户、企业和员工验证来应对这些威胁。不管你是想识别候选人欺诈、满足年龄限制,还是保护平台安全,Persona 都能按你的具体需求来做验证。更重要的是,它不会给正常用户增加太多摩擦,却能让你清楚知道自己在跟谁打交道。这也是 Etsy、LinkedIn、Square 和 Lyft 这样的领先平台信赖 Persona 的原因。Persona 还给我的听众提供一年共 500 次免费服务。直接去 withpersona.com/lenny 开始就行。再次感谢 Persona 赞助本期节目。

好,接下来我们来聊聊你跟这些公司合作的内容。你会帮他们做 AI 战略、AI 产品、搭什么工具、怎么构建这些东西。我想在这里多聊一点,因为现在很多公司都在做 AI 产品,只是很多公司做得并不顺。先问你几个相关的问题,看看你在真正做得好的公司里学到了什么。
第一个问题,是关于公司内部对 AI 工具的采用,或者更广义地说,对工具的采用。最近大家都在聊 AI 热潮,数据其实已经显示,大多数公司试过一下,效果一般,然后就停了。所以很多人开始怀疑:这东西是不是没戏了?从公司内部采纳 AI 工具这件事来看,你现在看到的是什么?

Chip Huyen就公司里的 GenAI 工具来说,我看到大概分成两类。第一类是内部生产力工具,比如 coding 工具、Slack 机器人、内部知识库。很多大企业都会在模型外面套一层 wrapper,接上某种 RAG 方案。我们前面聊的是基于文本的数据和 RAG,但还没聊 agentic RAG,也还没聊多模态 RAG。不过这确实是一个很令人兴奋的领域。它本质上是让员工能访问内部文档。比如有人问:“我准备生孩子,公司有没有产假或陪产假政策?健康保险能覆盖什么?我想安排面试、想推荐朋友,流程是什么?” 这类内部 chatbot 就是用来帮忙处理内部流程的。

第二类则是更面向客户或合作伙伴的工具。比如产品客服 chatbot 就很常见。如果你是连锁酒店,你可能会有 booking chatbot,这类东西规模非常大。booking chatbot 之所以很多,我猜和一个原因有关:很多公司之所以推进这些应用,是因为它们很难精确衡量结果。我觉得 booking chatbot 或 sales chatbot 就很适合,因为它们的结果非常清楚。你可以直接比较:现在有人类客服时的转化率是多少,用 chatbot 之后又是多少。某种程度上,这类方案的结果很明确,所以公司也更容易接受。
这就是另一类工具。我觉得对于客户侧、也就是对外的工具,人们更愿意选择结果很清楚的应用。要不要采用它,关键就在于能不能看到结果。当然,这也不是完美的,因为有时候结果不好,不一定是想法不行,也可能只是实现过程做得不够好。对,所以这很棘手。
而内部工具、内部生产力这块就更复杂了。我觉得很多公司在谈 AI 战略时,通常会看两个关键方面:第一是 use case,第二是人才。你可能有很好的数据和很好的 use case,但如果没有人才,你还是做不出来。
所以在 GenAI 刚开始普及时,我很佩服很多公司的做法。它们会说:“好,我们得让员工对 GenAI 有足够的认知,对 AI 够熟悉。” 于是它们会先给团队引入一堆工具使用,还会开很多 up-skilling workshop,鼓励大家学习,这其实非常好。很多公司也愿意花不少钱去推进这件事,给大家买订阅、买各种工具,想把员工训练得更 AI literacy。
但问题也在这里。很多公司会说:“我们花了很多钱在这些工具上,但怎么感觉大家没怎么用?” 你能看到使用量,但好像大家并没有真的把它们用起来,所以问题到底出在哪儿?这就很难说了。

Lenny Rachitsky你觉得问题在哪?是大家不会用吗?你觉得差距主要在哪里?你觉得未来会不会真的出现一种状态,让我们都觉得“哇,AI 已经让工作完全不一样了”?

Chip Huyen最核心的问题还是,生产力真的很难衡量。我经常跟很多人聊这件事。先说 coding,很多公司都在用 coding agent 或 coding 工具。我会问他们:“你觉得它有提升你的生产力吗?” 很多时候,回答都很模糊,比如“感觉好像更好了”。然后他们会说,因为 PR 更多了、代码更多了、提交更快了之类的。但问题是,代码量、上线代码量,这些都不是衡量生产力的好指标。所以这事真的很难。

我通常还会让大家去问他们的经理,因为我一般接触的是 VP 层,他们下面有很多团队。我就会问:“如果让你选,你更愿意给团队里的每个人配非常贵的 coding agent 订阅,还是再加一个人头?” 很多经理第一反应都会选人头。但如果你问 VP,或者那种管很多团队的人,他们就更可能说:我要 AI 助手。
这是因为作为经理,你还在成长阶段。你还没到那种要管成百上千人的位置,所以对你来说,多一个人头很重要。你想要它,未必是为了生产力,而是因为你就是想多几个人替你干活。但对高管来说,你更在意业务指标,所以你会去想,到底什么东西才真的在驱动你的生产力指标。
所以这件事挺棘手的。我不确定问题本质上是不是“AI 会不会让人更高效”,而是我们没有一个好办法去衡量生产力提升。

English No English text found
No English transcript text was found for this chapter.
章节 06 / 09

第06节

中文 译稿已完成

很多时候,我们在阅读时会被一些很小的细节绊住。所以我想做创意写作,不只是因为我想成为更好的作者,也因为它逼着我去面对不同的受众,让我更会预判这种不同类型的受众会想听什么、在意什么。对我来说,不管是写作还是任何形式的内容创作,本质上都是在预测用户的反应,对吧?

Lenny Rachitsky下一个 token。

Chip Huyen你做播客啊。

Lenny Rachitsky开玩笑的。

Chip Huyen对。做播客也是一样,就是你得想,用户会觉得什么有吸引力,对吧?很多公司发布产品时也会先讲一个叙事,然后再想怎么把这款产品定位成用户会想要的样子。所以我做技术写作很多年了,多少能猜到工程师想听什么、在意什么。但这次我面对的是完全不同的一类受众,我以前没有这方面的经验。所以我才想通过创意写作、写故事来训练这一点。

也正因为如此,我做了很多研究。我其实很享受这个过程,看了很多剧,就是想看看大家到底喜欢什么。我特别在意的一点,是我从一位编辑那里学到了什么叫“情绪旅程”。写东西的时候,我们会关心读者在整个故事里的感受。开头得有钩子,让人继续读下去;但也不能戏太满,因为人会看累,会情绪疲惫,因为很多时候你其实是在被情绪牵着走。所以一个好的故事要有情绪起伏,可能有高潮,也可能有比较轻松的段落。
还有一点是我之前没意识到的。对我来说,技术写作几乎完全聚焦在内容和论点上,非常客观、也很不个人化。比如 ML compiler 这种东西,读者在乎的是原理对不对,并不在乎讲的人是谁,因为它本身就是客观的。但小说不一样,读者会在乎角色是不是讨喜。
所以我故事的第一版里,人物都写得特别理性、特别讲逻辑,什么都按逻辑来。结果朋友看完后跟我说,他是个很好的人、特别棒的人,但他老实说,讨厌这个角色。也就是说,作为故事来说,这个人物太不讨喜了,所以他根本不想继续看。于是我写了第二版,把角色写得更可爱一点。让角色更讨喜的方法之一,就是加入一点脆弱感。有时候让她/他碰壁一下,因为我们有时能从这些地方产生共鸣。总的来说,这很有意思。很多东西都在于理解情绪层面,不只是故事本身,还有角色本身的感受。

Lenny Rachitsky这太有意思了,哇。我学到的东西比我原本想的还多。这个例子真的很棒。Chip,最后两个问题。大家如果想在线上找到你、联系你,或者想跟你合作,应该去哪儿?还有,听众怎样对你最有帮助?

Chip Huyen我在 LinkedIn 和 Twitter 上都有账号,不过我平时发得不多,但我一直在提醒自己应该多发一点,因为我其实很喜欢和读者交流。最近我也准备开始写 Substack 了。现在我已经先留了一个 Substack 占位页,我打算主要写系统思维,因为我觉得这是一项很有意思的能力。我还在考虑做一个 YouTube 频道,专门做书评,或者说那些能帮你把思考变得更好的书。

我想我会先评的第一本书,可能就是这本,因为它是我从小最喜欢的书,我一直在反复读。所以如果你们能帮忙的话,欢迎给我推荐你喜欢的书,尤其是那些真正改变了你思考方式、或者改变了你做事方式的书。我会很感激。

Lenny Rachitsky太好了,我已经很期待去读那本书了。

Chip Huyen嗯哼。

Lenny RachitskyChip,非常感谢你今天来。

Chip Huyen谢谢你邀请我,Lenny。

Lenny Rachitsky大家再见。非常感谢收听。如果你觉得这期有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客 App 上订阅这个节目。也请考虑给我们打个分,或者留个评论,这真的能帮助更多听众找到这档播客。你也可以在 lennyspodcast.com 找到过往所有节目,或者了解更多关于这档节目的信息。我们下一期再见。

很多时候,我们在阅读时会被一些很小的细节绊住。所以我想做创意写作,不只是因为我想成为更好的作者,也因为它逼着我去面对不同的受众,让我更会预判这种不同类型的受众会想听什么、在意什么。对我来说,不管是写作还是任何形式的内容创作,本质上都是在预测用户的反应,对吧?

Lenny Rachitsky下一个 token。

Chip Huyen你做播客啊。

Lenny Rachitsky开玩笑的。

Chip Huyen对。做播客也是一样,就是你得想,用户会觉得什么有吸引力,对吧?很多公司发布产品时也会先讲一个叙事,然后再想怎么把这款产品定位成用户会想要的样子。所以我做技术写作很多年了,多少能猜到工程师想听什么、在意什么。但这次我面对的是完全不同的一类受众,我以前没有这方面的经验。所以我才想通过创意写作、写故事来训练这一点。

也正因为如此,我做了很多研究。我其实很享受这个过程,看了很多剧,就是想看看大家到底喜欢什么。我特别在意的一点,是我从一位编辑那里学到了什么叫“情绪旅程”。写东西的时候,我们会关心读者在整个故事里的感受。开头得有钩子,让人继续读下去;但也不能戏太满,因为人会看累,会情绪疲惫,因为很多时候你其实是在被情绪牵着走。所以一个好的故事要有情绪起伏,可能有高潮,也可能有比较轻松的段落。
还有一点是我之前没意识到的。对我来说,技术写作几乎完全聚焦在内容和论点上,非常客观、也很不个人化。比如 ML compiler 这种东西,读者在乎的是原理对不对,并不在乎讲的人是谁,因为它本身就是客观的。但小说不一样,读者会在乎角色是不是讨喜。
所以我故事的第一版里,人物都写得特别理性、特别讲逻辑,什么都按逻辑来。结果朋友看完后跟我说,他是个很好的人、特别棒的人,但他老实说,讨厌这个角色。也就是说,作为故事来说,这个人物太不讨喜了,所以他根本不想继续看。于是我写了第二版,把角色写得更可爱一点。让角色更讨喜的方法之一,就是加入一点脆弱感。有时候让她/他碰壁一下,因为我们有时能从这些地方产生共鸣。总的来说,这很有意思。很多东西都在于理解情绪层面,不只是故事本身,还有角色本身的感受。

Lenny Rachitsky这太有意思了,哇。我学到的东西比我原本想的还多。这个例子真的很棒。Chip,最后两个问题。大家如果想在线上找到你、联系你,或者想跟你合作,应该去哪儿?还有,听众怎样对你最有帮助?

Chip Huyen我在 LinkedIn 和 Twitter 上都有账号,不过我平时发得不多,但我一直在提醒自己应该多发一点,因为我其实很喜欢和读者交流。最近我也准备开始写 Substack 了。现在我已经先留了一个 Substack 占位页,我打算主要写系统思维,因为我觉得这是一项很有意思的能力。我还在考虑做一个 YouTube 频道,专门做书评,或者说那些能帮你把思考变得更好的书。

我想我会先评的第一本书,可能就是这本,因为它是我从小最喜欢的书,我一直在反复读。所以如果你们能帮忙的话,欢迎给我推荐你喜欢的书,尤其是那些真正改变了你思考方式、或者改变了你做事方式的书。我会很感激。

Lenny Rachitsky太好了,我已经很期待去读那本书了。

Chip Huyen嗯哼。

Lenny RachitskyChip,非常感谢你今天来。

Chip Huyen谢谢你邀请我,Lenny。

Lenny Rachitsky大家再见。非常感谢收听。如果你觉得这期有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客 App 上订阅这个节目。也请考虑给我们打个分,或者留个评论,这真的能帮助更多听众找到这档播客。你也可以在 lennyspodcast.com 找到过往所有节目,或者了解更多关于这档节目的信息。我们下一期再见。

很多时候,我们在阅读时会被一些很小的细节绊住。所以我想做创意写作,不只是因为我想成为更好的作者,也因为它逼着我去面对不同的受众,让我更会预判这种不同类型的受众会想听什么、在意什么。对我来说,不管是写作还是任何形式的内容创作,本质上都是在预测用户的反应,对吧?

Lenny Rachitsky下一个 token。

Chip Huyen你做播客啊。

Lenny Rachitsky开玩笑的。

Chip Huyen对。做播客也是一样,就是你得想,用户会觉得什么有吸引力,对吧?很多公司发布产品时也会先讲一个叙事,然后再想怎么把这款产品定位成用户会想要的样子。所以我做技术写作很多年了,多少能猜到工程师想听什么、在意什么。但这次我面对的是完全不同的一类受众,我以前没有这方面的经验。所以我才想通过创意写作、写故事来训练这一点。

也正因为如此,我做了很多研究。我其实很享受这个过程,看了很多剧,就是想看看大家到底喜欢什么。我特别在意的一点,是我从一位编辑那里学到了什么叫“情绪旅程”。写东西的时候,我们会关心读者在整个故事里的感受。开头得有钩子,让人继续读下去;但也不能戏太满,因为人会看累,会情绪疲惫,因为很多时候你其实是在被情绪牵着走。所以一个好的故事要有情绪起伏,可能有高潮,也可能有比较轻松的段落。
还有一点是我之前没意识到的。对我来说,技术写作几乎完全聚焦在内容和论点上,非常客观、也很不个人化。比如 ML compiler 这种东西,读者在乎的是原理对不对,并不在乎讲的人是谁,因为它本身就是客观的。但小说不一样,读者会在乎角色是不是讨喜。
所以我故事的第一版里,人物都写得特别理性、特别讲逻辑,什么都按逻辑来。结果朋友看完后跟我说,他是个很好的人、特别棒的人,但他老实说,讨厌这个角色。也就是说,作为故事来说,这个人物太不讨喜了,所以他根本不想继续看。于是我写了第二版,把角色写得更可爱一点。让角色更讨喜的方法之一,就是加入一点脆弱感。有时候让她/他碰壁一下,因为我们有时能从这些地方产生共鸣。总的来说,这很有意思。很多东西都在于理解情绪层面,不只是故事本身,还有角色本身的感受。

Lenny Rachitsky这太有意思了,哇。我学到的东西比我原本想的还多。这个例子真的很棒。Chip,最后两个问题。大家如果想在线上找到你、联系你,或者想跟你合作,应该去哪儿?还有,听众怎样对你最有帮助?

Chip Huyen我在 LinkedIn 和 Twitter 上都有账号,不过我平时发得不多,但我一直在提醒自己应该多发一点,因为我其实很喜欢和读者交流。最近我也准备开始写 Substack 了。现在我已经先留了一个 Substack 占位页,我打算主要写系统思维,因为我觉得这是一项很有意思的能力。我还在考虑做一个 YouTube 频道,专门做书评,或者说那些能帮你把思考变得更好的书。

我想我会先评的第一本书,可能就是这本,因为它是我从小最喜欢的书,我一直在反复读。所以如果你们能帮忙的话,欢迎给我推荐你喜欢的书,尤其是那些真正改变了你思考方式、或者改变了你做事方式的书。我会很感激。

Lenny Rachitsky太好了,我已经很期待去读那本书了。

Chip Huyen嗯哼。

Lenny RachitskyChip,非常感谢你今天来。

Chip Huyen谢谢你邀请我,Lenny。

Lenny Rachitsky大家再见。非常感谢收听。如果你觉得这期有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客 App 上订阅这个节目。也请考虑给我们打个分,或者留个评论,这真的能帮助更多听众找到这档播客。你也可以在 lennyspodcast.com 找到过往所有节目,或者了解更多关于这档节目的信息。我们下一期再见。

很多时候,我们在阅读时会被一些很小的细节绊住。所以我想做创意写作,不只是因为我想成为更好的作者,也因为它逼着我去面对不同的受众,让我更会预判这种不同类型的受众会想听什么、在意什么。对我来说,不管是写作还是任何形式的内容创作,本质上都是在预测用户的反应,对吧?

Lenny Rachitsky下一个 token。

Chip Huyen你做播客啊。

Lenny Rachitsky开玩笑的。

Chip Huyen对。做播客也是一样,就是你得想,用户会觉得什么有吸引力,对吧?很多公司发布产品时也会先讲一个叙事,然后再想怎么把这款产品定位成用户会想要的样子。所以我做技术写作很多年了,多少能猜到工程师想听什么、在意什么。但这次我面对的是完全不同的一类受众,我以前没有这方面的经验。所以我才想通过创意写作、写故事来训练这一点。

也正因为如此,我做了很多研究。我其实很享受这个过程,看了很多剧,就是想看看大家到底喜欢什么。我特别在意的一点,是我从一位编辑那里学到了什么叫“情绪旅程”。写东西的时候,我们会关心读者在整个故事里的感受。开头得有钩子,让人继续读下去;但也不能戏太满,因为人会看累,会情绪疲惫,因为很多时候你其实是在被情绪牵着走。所以一个好的故事要有情绪起伏,可能有高潮,也可能有比较轻松的段落。
还有一点是我之前没意识到的。对我来说,技术写作几乎完全聚焦在内容和论点上,非常客观、也很不个人化。比如 ML compiler 这种东西,读者在乎的是原理对不对,并不在乎讲的人是谁,因为它本身就是客观的。但小说不一样,读者会在乎角色是不是讨喜。
所以我故事的第一版里,人物都写得特别理性、特别讲逻辑,什么都按逻辑来。结果朋友看完后跟我说,他是个很好的人、特别棒的人,但他老实说,讨厌这个角色。也就是说,作为故事来说,这个人物太不讨喜了,所以他根本不想继续看。于是我写了第二版,把角色写得更可爱一点。让角色更讨喜的方法之一,就是加入一点脆弱感。有时候让她/他碰壁一下,因为我们有时能从这些地方产生共鸣。总的来说,这很有意思。很多东西都在于理解情绪层面,不只是故事本身,还有角色本身的感受。

Lenny Rachitsky这太有意思了,哇。我学到的东西比我原本想的还多。这个例子真的很棒。Chip,最后两个问题。大家如果想在线上找到你、联系你,或者想跟你合作,应该去哪儿?还有,听众怎样对你最有帮助?

Chip Huyen我在 LinkedIn 和 Twitter 上都有账号,不过我平时发得不多,但我一直在提醒自己应该多发一点,因为我其实很喜欢和读者交流。最近我也准备开始写 Substack 了。现在我已经先留了一个 Substack 占位页,我打算主要写系统思维,因为我觉得这是一项很有意思的能力。我还在考虑做一个 YouTube 频道,专门做书评,或者说那些能帮你把思考变得更好的书。

我想我会先评的第一本书,可能就是这本,因为它是我从小最喜欢的书,我一直在反复读。所以如果你们能帮忙的话,欢迎给我推荐你喜欢的书,尤其是那些真正改变了你思考方式、或者改变了你做事方式的书。我会很感激。

Lenny Rachitsky太好了,我已经很期待去读那本书了。

Chip Huyen嗯哼。

Lenny RachitskyChip,非常感谢你今天来。

Chip Huyen谢谢你邀请我,Lenny。

Lenny Rachitsky大家再见。非常感谢收听。如果你觉得这期有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客 App 上订阅这个节目。也请考虑给我们打个分,或者留个评论,这真的能帮助更多听众找到这档播客。你也可以在 lennyspodcast.com 找到过往所有节目,或者了解更多关于这档节目的信息。我们下一期再见。
我还想帮大家再搞清楚一件事。你写了一本叫《AI Engineering》的书,本质上是在帮助大家理解一种新的工程师类型。你还特别简单地讲过 ML engineer 和 AI engineer 的区别,这个区分对现在的产品经理也很有对应关系,就是 AI product manager 和非 AI product manager。按照你的说法,我补一下我理解不到的地方:ML engineer 是自己训练模型的人,AI engineer 则是用现成模型来做产品的人。你还想补充什么吗?

Chip Huyen我其实很不喜欢写书的一点,就是它非得把概念定义清楚,但我觉得没有任何定义会是完美的,因为总会有边缘案例。不过大体上,我觉得这更像是 GenAI as a service,也就是“模型即服务”,别人已经帮你把模型训练好了,而且基础模型的能力已经相当强了。这样一来,很多人就会想:好,那我现在要把 AI 集成到产品里,我不一定需要先去学那些底层训练细节,虽然懂这些当然会有帮助。但整体上,它确实把想用 AI 做产品的门槛拉得非常低;与此同时,AI 的能力又很强,也把 AI 能被用来做什么的范围一下子拓宽了。所以我觉得,一方面进入门槛变低了,另一方面对 AI 应用的需求又变大了。整个局面特别让人兴奋,像是一下子打开了一个全新的可能性世界。

Lenny Rachitsky对。以前你得花时间去造这颗 AI 大脑,现在你不用了,直接拿来干活就行,这个解锁太大了。好,最后一个问题。你看得到很多什么东西有效、什么东西无效、未来会往哪走。我想问的是,如果把时间往后看两三年,你觉得做产品会有什么不一样?公司怎么运作会有什么不一样?如果只能挑一个你觉得未来几年最重要的变化,你会怎么说?

Chip Huyen我觉得很多组织其实并没有那么慢,但同时它们又比我预期的更快一点,因为我大概本来就不会去接触那些完全不在乎这事、像恐龙一样的公司。来找我的很多高管本身就很前瞻。所以对我来说,我接触到的组织本来就有点偏向“跑得快”的那一类。

所以我觉得,一个很大的变化会发生在组织结构上。我觉得这里面有很多价值。以前我们有很多彼此割裂的团队,工程团队和产品团队分得很清楚,但问题来了:eval 谁来写?指标谁来负责?结果你会发现,eval 根本不是一个独立问题,它是一个系统问题,因为你得看不同组件是怎么互相作用的;你还得看用户行为,因为你得知道用户真正关心什么,这样你写出来的 eval 才能反映用户真正关心的东西。
所以这些东西都得从不同组件的架构里去看,再加上 guardrail 之类的东西。工程负责的是把这些都搭起来,但理解用户,这就是产品的事。也正因为如此,eval 变得极其重要。它会把产品团队、工程团队,甚至像用户增长这样的营销团队,拉得更近。换句话说,组织结构会让原本非常分开的职能之间有更多沟通。
另一件事是,我也在看各个团队接下来几年里哪些工作能被自动化,哪些不能。我已经看到有些团队在裁掉一些以前外包出去的职能。老实说,这想起来有点吓人,但他们跟我说得倒是很直接:这挺好的,这对你我都好,不过像以前那些外包出去、而且不是核心的事情,我们已经不需要那么多人了。
传统上,这些事情本来就是业务外包,既不是核心,也可以被更标准化地处理。那这样一来,AI 就能把其中很大一部分自动化掉。所以大家也开始重新思考 junior engineer 和 senior engineer 的价值是什么、工程组织该怎么重构。对,我确实觉得,这会是成功组织里的一个重要变化。人们会不断调整棋子,思考新的 use case 要不要拆出来、该由谁来带一个新项目,这会是很大的变化。
还有一点是关于 AI 本身。我不确定这到底有多真,但我自己也在某种程度上认同这种判断:基础模型大概率还没到顶,但我们可能不会再看到那种特别夸张、飞跃式变强的模型了。
你应该还记得 GPT、GPT-2 那种跃升吧?然后是 GPT-3,再到 GPT-4、GPT-5。每一代都在进步,但如果要说有没有像最早那几次那样“巨大跃迁”,这件事就见仁见智了。所以我觉得,接下来基础模型性能的提升,大概不会像过去三年那样让人眼前一亮。真正会有很多进步的地方,我觉得是在后训练阶段、在应用构建阶段。对,我非常看好这部分。
我也特别关注多模态。我们现在已经看到了很多文本场景,但我觉得音频、视频这些用例会非常非常有意思。
而且音频这块其实还没那么成熟。我跟几家语音初创公司合作过,所以我真的觉得,voice 是完全不同的一个怪兽。比如说,如果你把聊天机器人从文字版变成语音版,整个系统的构成就完全不一样了。因为现在你得考虑延迟:先语音转文字,再文字对文字,再把文字问题变成文字答案,最后还要文字转语音。中间有很多 hop,所以 latency 变得非常重要。还有一个问题是,它怎么才会听起来像自然对话?
比如在人和人聊天的时候,如果我说话时你想打断我,说一句“Chip......”,我会停下来,听你把话说完。但有时候,即使你只是发出一个很短的确认音,比如“嗯嗯”,我也不该停,应该继续讲下去。所以,“什么时候该打断、什么时候不该打断”这个问题,对自然对话的感受影响非常大。这里面还有监管问题,因为很多人想做出那种听起来像真人的 AI 语音助手,故意让用户以为自己在和真人说话,但监管层面也可能会要求你明确告诉用户:现在跟你说话的是 AI 还是人。所以我觉得,这整个领域并没有大家想象得那么简单。
不过这又不完全是 AI 基础模型的问题,因为“人是否在插话”这件事,其实是个经典机器学习问题。
换个说法也行,你可以给它一个分类器来判断。至于 latency,那其实是个巨大的工程挑战,不是 AI 挑战。当然,它也可以变成 AI 挑战,因为大家现在在做 voice-to-voice 模型。也就是说,你不必先把我说的话转成文字,再让模型生成文字答案,再让另一个模型把文字转成语音;你可以直接从语音到语音。这就是我们现在在做的事,但真的很难。对,所以连音频我都觉得还没完全解决。某种意义上,它甚至比视频更简单一点,因为视频同时有图像和声音,本来就更难。所以这个方向还有很多挑战。

Lenny Rachitsky这份清单太棒了。我快速帮你复述一下。你预测的接下来几年里会改变我们工作方式的东西,也和我在这档播客里听过的很多对话很呼应。简单说,就是你在继续强调未来会往哪走。

第一,是不同职能之间的边界会变得越来越模糊,不会再只是设计和工程,大家都会做很多不同的事情。
第二,是更多工作会被 agent 和各种 AI 工具自动化掉,理论上生产力会继续上升。
第三,是重心会从预训练模型转向后训练、微调这类事情,因为照你说的,模型变聪明的速度可能会放慢。
不过我也想顺手补一句:我前阵子和 Anthropic 的联合创始人聊过,他提了个很好的点。他说,我们在身处指数增长中间时,其实很难真正理解指数增长是什么感觉。再加上模型发布的频率越来越高,所以两代模型之间的差异,我们可能没那么容易察觉,因为它们发布得太频繁了。跟 GPT-2 到 GPT-3 那种间隔更长的时期相比,也许这是有差别的,也许没有。你的第四点,就是多模态,以及对多模态体验的投入。我真的等不及 ChatGPT 语音模式把“插话”这件事做得更好,完全就是你刚才说的那个问题。我正跟它说着话呢,旁边有人轻轻哼一声,它就开始 ,然后停下来,烦死了。

Chip Huyen我也很震惊,为什么我们家里到现在还没有更好用的语音助手。我真的试过一堆,老实说。我总会想,天哪,这次说不定就是那个答案了,然后又发现它们差得不行,最后只能全送人。

Lenny Rachitsky我觉得它会来的。我听说快来了。Anthropic 也在跟某家公司合作,我不确定是不是已经上线了。

Chip Huyen对。顺着你刚才提到 Anthropic 那位嘉宾说的“性能提升”这件事,我想再拉回来讲一下。我觉得这里有个很大的变化,就是模型本身的能力,和我们在使用时感受到的性能,其实是两回事。也就是预训练模型本身,和模型在实际推理时表现出来的能力。你听过 `test time compute` 这个词吗?

Lenny Rachitsky我不太了解,给我们讲讲。

Chip Huyen这个想法大概是这样:你手里的算力是固定的。你会把大量算力花在预训练,也就是训练模型本身上;然后还会花一部分在后训练、微调上。不同实验室在预训练和后训练上的算力分配比例差异特别大。除此之外,模型训练好了,真正上线给用户用的时候,还要花算力做推理。也就是说,当我训练出一个模型并想把它服务给用户时,用户在 prompt 里问一句话,模型要生成回答,这一步也是要消耗算力的。

所以大家会争论:我到底该把更多算力花在预训练、微调,还是推理上?而推理这部分,我觉得有人就把它叫作 `test time compute`。也就是说,在测试/推理阶段多花算力,是一种策略:你把更多计算资源放到推理时的生成上,希望得到更好的表现。它为什么有效呢?

Lenny Rachitsky你会更愿意给团队里每个人都配一个很贵的 coding agent 订阅,还是多一个人头?大多数经理大概都会选后者。但如果你问的是 VP,或者那种管很多团队的人,他们往往会说,还是 AI 助手更划算。因为作为经理,你还在成长阶段,离那种要管成百上千人的层级还远。对你来说,多一个人头是件大事,所以你想要它,未必是为了生产力,而是因为你就是想多几个人替你干活;但对高管来说,你更看重业务指标,所以你会去想,到底什么才真的推动了你的生产力指标。

所以这问题挺棘手的。我不确定问题本质上是不是“用了 AI 会不会更高效”,更像是我们还没有一个好办法去衡量生产力提升。
还有一个现象也很有意思。我听不少人说,他们发现不同类型的员工,对 AI 助手的反应真的不一样。我老是拿 coding 来举例,因为 coding 很重要,而且也更容易分析。比如我听过一个例子:有人跟我说,在他们团队里,资深工程师最可能拿出最高产出,也最可能从这类工具里受益。那个人很有意思,他把团队分成了三档:当前表现最好、平均、和最弱。当然他不会把这件事告诉大家。然后他们做了一个随机试验,每一档里一半的人拿到 Cursor。结果他们观察一段时间后,发现了一个挺有意思的现象。
在他看来,提升最大的反而是资深工程师,也就是表现最好的人。第二高的是中间那组。他的理解是,这也合理,因为表现好的工程师本来就更会解决问题,所以他们更知道怎么把 AI 用在刀刃上;而表现最差的人,要么本来就不太在意工作,更容易直接开自动驾驶模式,让 AI 先生成代码再说,要么就是本来就不知道该怎么做。
不过另一家公司告诉我的情况却正好相反:他们说,资深工程师反而是最抗拒用 AI 工具的人,因为他们更有主见、标准也更高,会觉得“AI 写出来的代码太烂了”,所以非常排斥。我到现在也还没法把这些完全不一样的说法统一起来。

Lenny Rachitsky这太有意思了。让我确认一下我是不是听对了:有一家公司做了一个三档测试,把工程团队分成最高绩效、中等绩效和最低绩效三组,然后给其中一部分人、也就是每一组里的一半人,开通了 Cursor。是 Cursor 吧?

Chip Huyen我记得应该是 Cursor。

Lenny Rachitsky好。那在同一家公司里,是不是给高绩效工程师的一半配了 Cursor,另一半没有?他们到底是怎么切分的?

Chip Huyen对,他们是给全公司一半的人开通了,但每一档里都各拿出一半。对。

Lenny Rachitsky哇。

Chip Huyen然后他们去观察生产力差异。

Lenny Rachitsky明白了。那他们到底是怎么做的?就是“好,你拿 Cursor,你不用 Cursor”这样吗?这也太有意思了。

Chip Huyen具体怎么操作我没有深挖,但我当时心里想的是,能做随机试验已经很值得尊敬了。

Lenny Rachitsky这太酷了。

Chip Huyen是啊。

Lenny Rachitsky这支工程团队多大?是几百人吗?

Chip Huyen没那么大,大概 30 到 40 人左右。

Lenny Rachitsky30 到 40 人。好。

Chip Huyen对。

Lenny Rachitsky哇。也就是说,他们发现最高绩效工程师从 AI 工具里获益最大,然后是中间那档,最后才是最弱那档。明白了。

Chip Huyen不过这也不是到处都一样。

Lenny Rachitsky对对对,对。

Chip Huyen有些公司就不一样。

Lenny Rachitsky对。你刚才举的另一个例子里,资深工程师在这个场景下最抗拒改变工作方式,我能理解。我确实觉得,现在最有价值的人,除了像你这样的 ML 研究员和 AI 研究员,大概就是资深工程师了。因为感觉 junior 工程师能做的很多事情,现在都已经被 AI 接管了;但一个真正懂自己在做什么、懂大规模系统怎么运作、还能用好 AI 工具的人,就像手里有一群无限多的 junior 工程师听他差遣,感觉是极其有价值、极其强大的资产。

Chip Huyen对,我确实很欣赏这一点。你会看到,有些公司特别看重那些对整个系统有很好理解、能做出很强问题拆解和系统思考的人,而不是只盯着局部。我们还听过一家公司跟我说,他们现在的工作方式已经完全不一样了。他们把工程组织重新调整了一下,让更多资深工程师去做同侪评审,因为他们会去写什么才算好的工程实践、流程应该是什么样。

或者说,他们会写很多关于“怎么把事情做好”的流程。然后更多 junior 工程师负责产出代码、提 PR,而资深工程师更多是在审查这一端。所以我觉得,这可能是在为未来做准备。还有一家公司也跟我讲了非常类似的事:他们在为未来做准备,未来可能只需要一小群、非常非常强的工程师去制定流程、审代码、把关进生产环境,而让 AI 或 junior 工程师去产出代码。可问题又来了:那一个人到底怎么成长为足够强的资深工程师?

Lenny Rachitsky对,这才是问题。没错。

Chip Huyen对。所以我也不知道这个过程到底该怎么设计。

Lenny Rachitsky没人真的在想这个。这就是个问题。10 年、20 年后,我们可能就没有工程师了,因为根本没人再招 junior 工程师了。不过我也可以换个角度说。现在刚进计算机科学的人,本身就是 AI native。理论上,如果他们足够好奇,不只是把思考和学习外包给 AI,而是真的用 AI 学会如何写好代码、如何正确做架构,那他们反而可能会成长得非常快,最后成为未来最成功的工程师。

Chip Huyen我觉得你刚才说的 architecture,其实我会把它归到 system thinking 里。我确实觉得这是很重要的能力,因为 AI 可以帮我们自动化很多彼此割裂的技能,但知道怎么把这些技能组合起来解决问题,这件事很难。所以前阵子有一场我很喜欢的 webinar,嘉宾是 Mehran Sahami,斯坦福 CS 系的课程负责人,花了很多时间思考现在 AI 编程时代学生该学什么;另一位嘉宾是 Andrew Ng,当然是 AI 领域的传奇人物。Mehran Sahami 说过一句很有意思的话:很多人以为 CS 是关于 coding 的,但其实不是,coding 只是手段,不是目的。

CS 的核心是系统思维,是用 coding 去解决真实问题。只要问题还在,problem solving 就不会消失,因为 AI 会把更多东西自动化,但问题只会变得更大。而理解问题是怎么产生的、如何一步一步设计解决方案,这件事永远都在。所以我举个例子,我其实对 AI 在 debug 这件事上有不少意见。我不知道你平时会不会大量用 AI 写代码,但我自己和朋友们的体验都是:当任务非常清楚、非常定义明确时,它表现得不错,比如写文档、修一个具体功能,或者从零搭一个 app,尤其是不需要跟大规模代码库深度交互的时候。可一旦你把事情稍微搞复杂一点,比如需要和别的组件联动,它通常就没那么好用了。
比如我前阵子用 AI 去部署一个应用,当时我在试一个我不熟的新托管服务。平时它会先提醒我,所以 AI 确实让我更有信心去试新工具。以前我对尝试新工具会有点抗拒,因为一开始连文档都不一定看得懂。但那次我就想,行,试试吧,学一下。我就在测这个新托管服务,结果一直报 bug,特别烦。于是我就让 AI 去修,它就不停改来改去,可能改环境变量、改代码、把这个函数换成那个函数、改语言,甚至猜是不是它不支持 JavaScript,反正各种尝试都做了,但就是不行。最后我就想,算了。
我自己去看文档,看看到底哪里错了。结果发现,问题根本不是代码,而是我当时用的那个套餐/层级里,这个功能本来就不可用。也就是说,AI 一直在朝错误的方向修,试图从别的组件上解决问题,但真正的问题其实出在不同的组件层面。所以这件事让我意识到,理解不同组件是怎么协同工作的、问题源头可能从哪里来,你得有一个整体视角。它也让我开始想:我们怎么教 AI 学会 system thinking?很多人类专家在处理这类问题时,其实都有一套很清晰的 scaffolding,比如“先看这个,再看那个,再看那个”。AI 也许可以靠这样的框架学会系统思维,但这也让我反过来想:我们该怎么教人类学会系统思维?所以我觉得这是一项非常有意思、也非常重要的能力。

Lenny Rachitsky这和 Bret Taylor 在播客里分享的观点几乎一模一样。他是 Sierra 的联合创始人,也做过 Google Maps,当过 Salesforce、Quip 等公司的 CEO。我问过他,大家到底要不要学编程,他的观点和你说的一样:学计算机科学不是为了学 Java 和 Python,而是学系统怎么运作、代码怎么工作、软件整体怎么运作,而不是“这里有个函数,拿去做个事”。

English No English text found
No English transcript text was found for this chapter.
章节 07 / 09

第07节

中文 译稿已完成

Lenny Rachitsky我还想帮大家再搞清楚一件事。你写了一本叫《AI Engineering》的书,本质上是在帮助大家理解一种新的工程师类型。你还特别简单地讲过 ML engineer 和 AI engineer 的区别,这个区分对现在的产品经理也很有对应关系,就是 AI product manager 和非 AI product manager。按照你的说法,我补一下我理解不到的地方:ML engineer 是自己训练模型的人,AI engineer 则是用现成模型来做产品的人。你还想补充什么吗?

Chip Huyen我其实很不喜欢写书的一点,就是它非得把概念定义清楚,但我觉得没有任何定义会是完美的,因为总会有边缘案例。不过大体上,我觉得这更像是 GenAI as a service,也就是“模型即服务”,别人已经帮你把模型训练好了,而且基础模型的能力已经相当强了。这样一来,很多人就会想:好,那我现在要把 AI 集成到产品里,我不一定需要先去学那些底层训练细节,虽然懂这些当然会有帮助。但整体上,它确实把想用 AI 做产品的门槛拉得非常低;与此同时,AI 的能力又很强,也把 AI 能被用来做什么的范围一下子拓宽了。所以我觉得,一方面进入门槛变低了,另一方面对 AI 应用的需求又变大了。整个局面特别让人兴奋,像是一下子打开了一个全新的可能性世界。

Lenny Rachitsky对。以前你得花时间去造这颗 AI 大脑,现在你不用了,直接拿来干活就行,这个解锁太大了。好,最后一个问题。你看得到很多什么东西有效、什么东西无效、未来会往哪走。我想问的是,如果把时间往后看两三年,你觉得做产品会有什么不一样?公司怎么运作会有什么不一样?如果只能挑一个你觉得未来几年最重要的变化,你会怎么说?

Chip Huyen我觉得很多组织其实并没有那么慢,但同时它们又比我预期的更快一点,因为我大概本来就不会去接触那些完全不在乎这事、像恐龙一样的公司。来找我的很多高管本身就很前瞻。所以对我来说,我接触到的组织本来就有点偏向“跑得快”的那一类。

所以我觉得,一个很大的变化会发生在组织结构上。我觉得这里面有很多价值。以前我们有很多彼此割裂的团队,工程团队和产品团队分得很清楚,但问题来了:eval 谁来写?指标谁来负责?结果你会发现,eval 根本不是一个独立问题,它是一个系统问题,因为你得看不同组件是怎么互相作用的;你还得看用户行为,因为你得知道用户真正关心什么,这样你写出来的 eval 才能反映用户真正关心的东西。
所以这些东西都得从不同组件的架构里去看,再加上 guardrail 之类的东西。工程负责的是把这些都搭起来,但理解用户,这就是产品的事。也正因为如此,eval 变得极其重要。它会把产品团队、工程团队,甚至像用户增长这样的营销团队,拉得更近。换句话说,组织结构会让原本非常分开的职能之间有更多沟通。
另一件事是,我也在看各个团队接下来几年里哪些工作能被自动化,哪些不能。我已经看到有些团队在裁掉一些以前外包出去的职能。老实说,这想起来有点吓人,但他们跟我说得倒是很直接:这挺好的,这对你我都好,不过像以前那些外包出去、而且不是核心的事情,我们已经不需要那么多人了。
传统上,这些事情本来就是业务外包,既不是核心,也可以被更标准化地处理。那这样一来,AI 就能把其中很大一部分自动化掉。所以大家也开始重新思考 junior engineer 和 senior engineer 的价值是什么、工程组织该怎么重构。对,我确实觉得,这会是成功组织里的一个重要变化。人们会不断调整棋子,思考新的 use case 要不要拆出来、该由谁来带一个新项目,这会是很大的变化。
还有一点是关于 AI 本身。我不确定这到底有多真,但我自己也在某种程度上认同这种判断:基础模型大概率还没到顶,但我们可能不会再看到那种特别夸张、飞跃式变强的模型了。
你应该还记得 GPT、GPT-2 那种跃升吧?然后是 GPT-3,再到 GPT-4、GPT-5。每一代都在进步,但如果要说有没有像最早那几次那样“巨大跃迁”,这件事就见仁见智了。所以我觉得,接下来基础模型性能的提升,大概不会像过去三年那样让人眼前一亮。真正会有很多进步的地方,我觉得是在后训练阶段、在应用构建阶段。对,我非常看好这部分。
我也特别关注多模态。我们现在已经看到了很多文本场景,但我觉得音频、视频这些用例会非常非常有意思。
而且音频这块其实还没那么成熟。我跟几家语音初创公司合作过,所以我真的觉得,voice 是完全不同的一个怪兽。比如说,如果你把聊天机器人从文字版变成语音版,整个系统的构成就完全不一样了。因为现在你得考虑延迟:先语音转文字,再文字对文字,再把文字问题变成文字答案,最后还要文字转语音。中间有很多 hop,所以 latency 变得非常重要。还有一个问题是,它怎么才会听起来像自然对话?
比如在人和人聊天的时候,如果我说话时你想打断我,说一句“Chip......”,我会停下来,听你把话说完。但有时候,即使你只是发出一个很短的确认音,比如“嗯嗯”,我也不该停,应该继续讲下去。所以,“什么时候该打断、什么时候不该打断”这个问题,对自然对话的感受影响非常大。这里面还有监管问题,因为很多人想做出那种听起来像真人的 AI 语音助手,故意让用户以为自己在和真人说话,但监管层面也可能会要求你明确告诉用户:现在跟你说话的是 AI 还是人。所以我觉得,这整个领域并没有大家想象得那么简单。
不过这又不完全是 AI 基础模型的问题,因为“人是否在插话”这件事,其实是个经典机器学习问题。
换个说法也行,你可以给它一个分类器来判断。至于 latency,那其实是个巨大的工程挑战,不是 AI 挑战。当然,它也可以变成 AI 挑战,因为大家现在在做 voice-to-voice 模型。也就是说,你不必先把我说的话转成文字,再让模型生成文字答案,再让另一个模型把文字转成语音;你可以直接从语音到语音。这就是我们现在在做的事,但真的很难。对,所以连音频我都觉得还没完全解决。某种意义上,它甚至比视频更简单一点,因为视频同时有图像和声音,本来就更难。所以这个方向还有很多挑战。

Lenny Rachitsky这份清单太棒了。我快速帮你复述一下。你预测的接下来几年里会改变我们工作方式的东西,也和我在这档播客里听过的很多对话很呼应。简单说,就是你在继续强调未来会往哪走。

第一,是不同职能之间的边界会变得越来越模糊,不会再只是设计和工程,大家都会做很多不同的事情。
第二,是更多工作会被 agent 和各种 AI 工具自动化掉,理论上生产力会继续上升。
第三,是重心会从预训练模型转向后训练、微调这类事情,因为照你说的,模型变聪明的速度可能会放慢。
不过我也想顺手补一句:我前阵子和 Anthropic 的联合创始人聊过,他提了个很好的点。他说,我们在身处指数增长中间时,其实很难真正理解指数增长是什么感觉。再加上模型发布的频率越来越高,所以两代模型之间的差异,我们可能没那么容易察觉,因为它们发布得太频繁了。跟 GPT-2 到 GPT-3 那种间隔更长的时期相比,也许这是有差别的,也许没有。你的第四点,就是多模态,以及对多模态体验的投入。我真的等不及 ChatGPT 语音模式把“插话”这件事做得更好,完全就是你刚才说的那个问题。我正跟它说着话呢,旁边有人轻轻哼一声,它就开始 ,然后停下来,烦死了。

Chip Huyen我也很震惊,为什么我们家里到现在还没有更好用的语音助手。我真的试过一堆,老实说。我总会想,天哪,这次说不定就是那个答案了,然后又发现它们差得不行,最后只能全送人。

Lenny Rachitsky我觉得它会来的。我听说快来了。Anthropic 也在跟某家公司合作,我不确定是不是已经上线了。

Chip Huyen对。顺着你刚才提到 Anthropic 那位嘉宾说的“性能提升”这件事,我想再拉回来讲一下。我觉得这里有个很大的变化,就是模型本身的能力,和我们在使用时感受到的性能,其实是两回事。也就是预训练模型本身,和模型在实际推理时表现出来的能力。你听过 `test time compute` 这个词吗?

Lenny Rachitsky我不太了解,给我们讲讲。

Chip Huyen这个想法大概是这样:你手里的算力是固定的。你会把大量算力花在预训练,也就是训练模型本身上;然后还会花一部分在后训练、微调上。不同实验室在预训练和后训练上的算力分配比例差异特别大。除此之外,模型训练好了,真正上线给用户用的时候,还要花算力做推理。也就是说,当我训练出一个模型并想把它服务给用户时,用户在 prompt 里问一句话,模型要生成回答,这一步也是要消耗算力的。

所以大家会争论:我到底该把更多算力花在预训练、微调,还是推理上?而推理这部分,我觉得有人就把它叫作 `test time compute`。也就是说,在测试/推理阶段多花算力,是一种策略:你把更多计算资源放到推理时的生成上,希望得到更好的表现。它为什么有效呢?

English No English text found
No English transcript text was found for this chapter.
章节 08 / 09

第08节

中文 译稿已完成

Lenny Rachitsky太棒了。

Chip Huyen明白我的意思吗?

Lenny Rachitsky对,完全明白。

Chip Huyen对吧?

Lenny Rachitsky这和 Ben Man 的观点正好呼应。

Chip Huyen对。

Lenny RachitskyChip,我们今天聊了很多。我已经把我原本想学的东西都学到了,甚至更多。在进入很激动人心的闪电问答之前,你还有什么想分享的吗?或者有什么想留给听众的话?

Chip Huyen我会跟一些公司合作,他们会希望员工自己提出点子。所以现在有个很大的争论,就是 AI 战略到底该自上而下,还是自下而上。是该由高管先定一两个杀手级 use case,把所有资源都压上去,还是该让工程师、PM 和聪明的人自己想点子?我觉得两种都要。

有些公司会说,好,我们招了一批很聪明的人,看看他们能想出什么,然后就组织更多 hackathon 或内部挑战,鼓励大家做产品。可我注意到一个现象,很多人根本不知道你们到底做了什么,这让我很震惊。我有时会觉得,我们好像正处在某种“点子危机”里,对吧?
现在我们明明有这么多很酷的工具,可以从零开始做东西,可以做设计、写代码、搭网站。按理说,我们应该看到更多作品才对,但现实是,很多人还是卡住了,不知道该做什么。我觉得这可能跟社会期待有关,因为我们已经进入了一个高度分工、极度专业化的阶段,大家被要求把一件事做到极致,却越来越少人有大局观。而没有大局观,就很难想出“到底该做什么”这种点子。
所以我跟这家公司做 hackathon 的时候,我们会一起想一套怎么出点子的指南。通常我们会说,一个很实用的方法是:回头看过去一周。整整一周里,注意你自己都做了什么,什么事情最让你烦。每次你被什么东西惹烦了,就想一想,有没有办法换一种做法?能不能把它变得不那么烦?如果你发现团队里很多人都有同样的烦点,那可能就是个值得做的东西。所以我会觉得,去观察自己的工作方式,反复想办法,不停问自己:怎么才能更好?然后就围绕这些烦恼做点东西,这其实是学习和采用 AI 的很好方式。

Chip Huyen我很想看看那个。我对用 AI 做这种微型工具非常看好,真的,就做一点点小工具,让生活稍微轻松一点。

Lenny Rachitsky百分之百。我觉得这也是大家用这些工具的主要方式之一,就是解决自己手上某个很小但很具体的问题。好,Chip,我们进入非常激动人心的闪电问答环节。我有五个问题要问你,准备好了吗?

Chip Huyen当然,随时可以。不过也得看问题有多难。

Lenny Rachitsky这些问题对每位嘉宾都很一致,所以我猜你应该已经听过了。第一个问题是:你最常向别人推荐的两三本书是什么?

Chip Huyen我其实挺怕推荐书的,因为我觉得应该读什么书,真的很取决于别人想要什么、正处在什么人生阶段、以及想走到哪里。不过,有几本书确实改变了我看世界和思考问题的方式。第一本就是《自私的基因》。它帮我思考过一个问题:我到底要不要孩子。因为它让我更理解我们很多行为、很多运作方式,其实都跟基因有关,而基因只想做一件事,就是繁衍。

某种程度上,这本书还提出了另一点:每个人都想永生,也许这不是有意识的愿望,但潜意识里我们确实都想这样。而永生有两种方式。一种是通过基因,让基因一直延续下去;另一种是通过想法。如果你有一些想法留在世上,而且能流传很久,它也会继续活下去。我知道这有点抽象,但我觉得特别有意思。
我还特别喜欢另一本书,作者是新加坡的前领导人,我记得他应该被称作“新加坡国父”李光耀。我不确定书名是不是这个,但就是他在 25 年内把新加坡从第三世界国家带到了第一世界国家。我几乎没见过哪个国家领导人,会这么认真地把自己关于“如何建设一个国家”的想法写下来。
这本书里讲很多公共政策,比如怎么制定政策,去鼓励人们做对国家有益的事,也会讲外交、对外政策、国家解放之类的话题。所以它特别值得思考。对我来说,这也是一种系统思维,只不过它面对的是另一种系统,也就是一个国家,而我们大多数人这辈子都没有机会真正去“实验”这样一个系统。所以学一学这类东西很好。

Lenny Rachitsky第二本书叫什么名字来着?

Chip Huyen叫《从第三世界到第一世界》。其实我好像就放在这儿。

Lenny Rachitsky原来在这儿。

Chip Huyen这本书真的很厚。

Lenny Rachitsky现场展示一下。

Chip Huyen对。

Lenny Rachitsky太棒了。我一定想读这本。这个推荐真的很好。我听过很多关于他影响力的事,也在 Twitter 上看过很多他关于如何建设一个繁荣社会的深刻见解。显然,这套方法是有效的。

Chip Huyen对啊。你敢信吗,他怎么会有时间写这么厚的一本书?太夸张了。

Lenny Rachitsky是啊。Claude,请帮我总结一下,开个玩笑。顺便说一句,《自私的基因》我也特别喜欢。这个选择真的很好,是那种特别低调、但确实改变了我看世界方式的书。所以选得非常棒。好,下一题。你最近有没有特别喜欢的电影或电视剧?

Chip Huyen我看了很多电影和电视剧,算是做研究吧,因为我正在写我的第一部小说,而且最近已经卖出去了。所以我特别想研究一下什么样的东西会打动人。它是部剧情片,不是科幻,也不是科技圈的人通常会看的那种。我知道这很跳脱,也挺出人意料的,所以我看电视、看剧,其实更像是在观察什么样的故事会受欢迎,去理解套路之类的东西。所以我也不确定听众会不会喜欢……

Lenny Rachitsky那能说一个吗?有没有哪部作品让你学到了一些关于写作的东西?

Chip Huyen我觉得像《延禧攻略》就是一个。那是一部中国电视剧。

Lenny Rachitsky酷,我在这档播客里还没听人提过这部。

Chip Huyen对。

Lenny Rachitsky下一题。你有没有什么人生信条,会在工作或生活遇到困难时经常想起、拿来提醒自己的?

Chip Huyen这听起来很虚无主义。我会说,最后其实什么都不重要。通常我想到的是,从更长远的尺度看,一百亿年后,什么都不会留下,也不会有人在场。我知道有人会跟我争论这一点。我的理论就是:十亿年后,我们都不会存在。所以不管我们现在做了多乱、多疯狂、做得多差,反正到时候也不会有人记得。某种意义上,这听起来很吓人,但也很解放,因为它让我可以说,好吧,那就试试看呗,为什么不呢?

最近我家里有位亲人去世了,我当时没能回去,所以我跟我爸聊这件事。我问他:“好吧,有没有什么我能做的,能让那个人……”大概就是安慰一下,或者送点什么。结果我爸说:“这个时候,他还能想要什么呢?”这件事让我更强烈地觉得,到了人生最后,物质根本带不来什么快乐。没有钱,没有产品,什么都没有。于是我就会想,那我到底真正关心什么?
所以我会想,也许我失败了,也许我没拿到那个合同,也许会发生很多事,但到了生命尽头,我不觉得那些真的还重要。所以某种意义上,这反而很解放。

Lenny Rachitsky我知道你说这听起来可能有点虚无主义。Steve Jobs 在他最著名的演讲里也说过类似的话:我们总有一天都会死,所以别把事情看得太重。这确实会让人轻松很多。它会让你更珍惜每一个时刻、每一天。然后你就会想,好吧,那就去做点难又吓人的事。好,最后一个问题。你刚才提到你在写小说。科技圈里大多数人从来没写过这种创作性、虚构类的东西。你在这个过程中学到的、关于怎么写更好的故事、写更好的小说,有什么最重要的一点?

English No English text found
No English transcript text was found for this chapter.
章节 09 / 09

第09节

中文 译稿已完成

很多时候,我们在阅读时会被一些很小的细节绊住。所以我想做创意写作,不只是因为我想成为更好的作者,也因为它逼着我去面对不同的受众,让我更会预判这种不同类型的受众会想听什么、在意什么。对我来说,不管是写作还是任何形式的内容创作,本质上都是在预测用户的反应,对吧?

Lenny Rachitsky下一个 token。

Chip Huyen你做播客啊。

Lenny Rachitsky开玩笑的。

Chip Huyen对。做播客也是一样,就是你得想,用户会觉得什么有吸引力,对吧?很多公司发布产品时也会先讲一个叙事,然后再想怎么把这款产品定位成用户会想要的样子。所以我做技术写作很多年了,多少能猜到工程师想听什么、在意什么。但这次我面对的是完全不同的一类受众,我以前没有这方面的经验。所以我才想通过创意写作、写故事来训练这一点。

也正因为如此,我做了很多研究。我其实很享受这个过程,看了很多剧,就是想看看大家到底喜欢什么。我特别在意的一点,是我从一位编辑那里学到了什么叫“情绪旅程”。写东西的时候,我们会关心读者在整个故事里的感受。开头得有钩子,让人继续读下去;但也不能戏太满,因为人会看累,会情绪疲惫,因为很多时候你其实是在被情绪牵着走。所以一个好的故事要有情绪起伏,可能有高潮,也可能有比较轻松的段落。
还有一点是我之前没意识到的。对我来说,技术写作几乎完全聚焦在内容和论点上,非常客观、也很不个人化。比如 ML compiler 这种东西,读者在乎的是原理对不对,并不在乎讲的人是谁,因为它本身就是客观的。但小说不一样,读者会在乎角色是不是讨喜。
所以我故事的第一版里,人物都写得特别理性、特别讲逻辑,什么都按逻辑来。结果朋友看完后跟我说,他是个很好的人、特别棒的人,但他老实说,讨厌这个角色。也就是说,作为故事来说,这个人物太不讨喜了,所以他根本不想继续看。于是我写了第二版,把角色写得更可爱一点。让角色更讨喜的方法之一,就是加入一点脆弱感。有时候让她/他碰壁一下,因为我们有时能从这些地方产生共鸣。总的来说,这很有意思。很多东西都在于理解情绪层面,不只是故事本身,还有角色本身的感受。

Lenny Rachitsky这太有意思了,哇。我学到的东西比我原本想的还多。这个例子真的很棒。Chip,最后两个问题。大家如果想在线上找到你、联系你,或者想跟你合作,应该去哪儿?还有,听众怎样对你最有帮助?

Chip Huyen我在 LinkedIn 和 Twitter 上都有账号,不过我平时发得不多,但我一直在提醒自己应该多发一点,因为我其实很喜欢和读者交流。最近我也准备开始写 Substack 了。现在我已经先留了一个 Substack 占位页,我打算主要写系统思维,因为我觉得这是一项很有意思的能力。我还在考虑做一个 YouTube 频道,专门做书评,或者说那些能帮你把思考变得更好的书。

我想我会先评的第一本书,可能就是这本,因为它是我从小最喜欢的书,我一直在反复读。所以如果你们能帮忙的话,欢迎给我推荐你喜欢的书,尤其是那些真正改变了你思考方式、或者改变了你做事方式的书。我会很感激。

Lenny Rachitsky太好了,我已经很期待去读那本书了。

Chip Huyen嗯哼。

Lenny RachitskyChip,非常感谢你今天来。

Chip Huyen谢谢你邀请我,Lenny。

Lenny Rachitsky大家再见。非常感谢收听。如果你觉得这期有帮助,欢迎在 Apple Podcasts、Spotify 或你常用的播客 App 上订阅这个节目。也请考虑给我们打个分,或者留个评论,这真的能帮助更多听众找到这档播客。你也可以在 lennyspodcast.com 找到过往所有节目,或者了解更多关于这档节目的信息。我们下一期再见。

English No English text found
No English transcript text was found for this chapter.