Transcript
Introduction
**** · so let's get started so I'll be talking about building llms today so I think a lot of you have heard of llms before but just as a quick recap llms standing for large language models are all the chat Bots that you've been hearing about recently so Chad GPT from open ey Claud from entropic Gemini and lman other type of models this and today we'll be talking about how do they work so it's going to be an overview because it's only one lecture and it's hard to compress everything but hopefully I'll touch a little bit about all the components that are needed to train some of these llms also if you have questions please interrupt me and ask if you have a question most likely other people in the room or on Zoom have other have the same question so please ask great so what matters when training llms so there a few key components that matter one is the architecture so as you probably all know LMS are newal networks and when you think about new networks you have to think about what architecture you're using and another component which is really important is the training loss and the training algorithm so how you train these models then it's data so what do you train these models on the evaluation which is how do whether you're making progress towards the goal of llms and then the system component so that is how do you make these models run on Modern Hardware which is really important because these models are really large so now more than ever system is really an important topic for llms so those five components You probably all know that llms and if you don't know LMS are all based on Transformers or at least some version of Transformers I'm not going to talk about the AR lecture today one because I gave a SE lecture on Transformers a few weeks ago and two because you can find so much information online on Transformers but I think you can it's there's much less information about the other four topics so I really want to talk about those another thing to say is that most of Academia focuses on architecture and training algorithm and losses as academics and I've done that for a lot big part of my career is simply we thinking that this is we make new architectures new models and it seems it's very important but in reality honestly what matters in practice is mostly the three other topics so data evaluation and systems which is what of most of Industry focuses on so that's also one of the reason why I don't want to talk too much about the architecture because really the rest is super important great so overview of the lecture I'll be talking about pre-training so pre-training you probably heard that word this is the general word this is the classical language modeling Paradigm where you train your language model to essentially model all of internet and then there's a post training which is a more recent Paradigm which is taking these large language models and making them essentially AI assistants so this is more of a recent Trend since Chad GPT so if you ever heard of gpt3 or gpt2 that's really pre-training land if you heard of chat GPT which you probably have this is really posttraining land so I'll be talking about both but I'll start with pre-training and specifically I'll talk about what is the task of pre-training llms and what is the laws that people use so language modeling this is a quick recap language models at a high level are simply models of probability distribution over sequences of tokens or of words so it's some model of P of X1 to XL where X1 is word one and Excel is the last one in the sequence or in the sentence so very concretely if you have a sentence the mouse ate the cheese what the language model gives you is simply a probability of this sentence being uttered by a human or being found on online so if you have another sentence the mouse at cheese here there's grammatical mistakes so the model should know that this should have some syntactic knowledge so it should know that this has less likelihood of appearing online if you have another sentence the cheese ate the mouse then the model should hopefully know about the fact that usually cheese don't eat Mouse so there's some semantic knowledge and this is less likely than the first sentence so this is at a high level what language models are one word that you probably have been hearing a lot in the news are generative models so this is just something that can generate models that can generate sentences or can generate some data the reason why we say language models are generative models is that once you have a model of a distribution you can simply sample from this model and now we can generate data so you can generate sentences using a language model so the type of models that people are all currently using are what we call Auto regressive language models and the key idea of autor regressive language models is that you take this distribution over words and you decompose it into the into the distribution of the first word multiply the by the distribution of or the likelihood of the distribution of the second word given the first word multiply by P of the third word given the first two words so there's no approximation here this is just the chain rule of probability which you hopefully all know about really no approximation this is just one way of modeling a distribution so slightly more concisely you can write it as a product of U of PS of the next word given everything which happened in the past so of the context and so this is what we call Auto regressive language models again this is really not the only way of modeling distribution this is just one way it has some benefits and some downsides one downside of autoaggressive language models is that when you sample from this autoaggressive language model you have a for Loop which generates the next word then conditions on that next word and then regenerate an other word so if you have a longer sentence that you want to generate you it takes more time to generate it so there are some downsides of this current Paradigm but that's what we currently have so I'm going to talk about this one great so Auto regressive language models at a high level what the task of autoregressive language model is simply predicting the next word as I just said so if you have a sentence she likely prefers one potential next word might be dogs and the way we do it is that we first tokenize so you take these words or subwords you tokenize them and then you give an IDE for each token so here you have 1 2 three then you pass it through this black box as I already said we're not going to talk about the architecture you just pass it pass it through a model and you then get a distribution a probability distribution over the next word over the next token and then you sample from this distribution you get a new token and then you DET tokenize so you get a new ID you then DET toonize and that's how you sample from a language model one thing which is important to not is that the last two TS two steps are only need needed during inference when you do training you just need to predict the most likely token and you can just compare to the real token which happen in practice and then you change the weights of your model to increase the probability of generating that token great so autoaggressive neural language models so to be slightly more specific still without talking about the architecture the first thing we do is that we have all of these oh sorry yes on the previous slide when you're predicting the probability of the next tokens does this mean that your final output VOR has to be the same dimensionality as the number of tokens that you have yes how do you deal with if you have more to if you're adding more tokens to your cor something yeah so we're going to talk about tokenization later so you will get some sense of this you can deal with adding new tokens I am I'm exaggerating there are methods for doing it but essentially people don't do it so it's really important to think about how you tokenize your text and that's why we'll talk about that later but it's a very good point to notice that you the vocabulary size so the number of tokens that you have is essentially the output of your language model so it's pretty large okay so autoaggressive new language models first thing you do is that you take every word or every token you embed them so you get a some Vector representation for each of these tokens you pass them through some ual Network as we said it's a Transformer then you get a representation for all the word in all the words in the context so it's representation of the entire sentence you pass it through a linear layer as you just said to map it to the number so that the output the number of outputs is the number of tokens you then pass it through some soft Max and you get probity distribution over the next words given every word in the context and the law that you use is it's essentially a task of classifying the next token so it's a very simple machine learning task so you use the cross entry P loss where you you look at the actual Target that happened which is a target distribution which is a one hot encoding which here in this in this case says I saw the real word that happened is cat so that's a one hot distribution over cat and here this is the actual do you see my mouse oh yeah this is the distribtion that you generated and you do cross entropy which really just increases the probability of generating cat and decreases all the probility of generating all the other tokens one thing to notice is that as you all know again this is just equivalent to maximizing the text log the text log likelihood because you can just rewrite the max over the probability of this autoregressive language moding task as just being this minimum over I just added the log here and minus which is just the minimum of the loss which is the cross enty loss so minimizing the loss is the same thing as maximizing the likelihood of your text any question questions okay tokenizer so this is one thing that people usually don't talk that much about tokenizers are extremely important so it's really important that you understand at least what they do at a high level so why do we need token in the first place first it's more General than words so one simple thing that you might think is oh we're just going to take every word that we will have you just say every word is a new is a token in its own but then what happens is if there's a typo in your word then you might not have any token associated with this word with a typo and then you don't know how to pass this word with a typo into a large language model so what do you do next and also even if you think about words is a very words are fine with Latin based languages but if you think about a language taii you won't have a simple way of tokenizing by spaces because there are no spaces between words so really tokens are much more General Than Words first thing second thing that you might think is that you might tokenize every sentence character by character you might say a is one token b is another token that would work and probably very well the issue is that then your sequence becomes super long and as you probably remember from the lecture on Transformers the complexity grows quadratically with the length of sequences so you really don't want to have a super long sequence so tokenizers try to deal with those two problems and give common subsequences a certain token and usually how you should be think about is around an average every token is around three four letters and there are many algorithm for tokenization I'll just talk about one of them to give you a high level which is what we call bite P en coding which is pretty common one of the two most common tokenizers and the way that you train a tokenizer is that first you start with a very large Corpus of text and here I'm really not talking about training a large language model yet this is purely for the tokenization step so this is my large Corpus of text with these five words then you associate every character in this Corpus of text a different token so here I just split up every character with a different token and I just color coded all of those tokens and then what you do is that you go through your text and every time you see pairs of tokens that are very common the most common pair of token you just merge them so here you see three times the the tokens T and O next to each other so you're just going to say this is a new token and then you continue you repeat that so now you have to talk which happens three times to with an E that happens sorry two times and an token which happens twice and then ex which also happen twice so this is that if you were to train a tokenizer on this Corpus of text which is very small that's how you would finish with a token with a pre a trained tokenizer in reality you do it on much larger corpuses of text and this is the real tokenizer of I think this is gpt3 or chat GPT and here you see how it would separate these words so you see the same thing as what we gave in the previous example token becomes its own token so tokenizer is split up into two tokens token and iser so yeah that's all about tokenizers any questions on that yeah how do you deal with spes and how do you deal with yeah so there's a step before tokenizers which is what we call pre- tokenizers which is exactly what you just said so this is mostly in theory there's no reason to deal with spaces and punctuation separately you could just say every space gets its own token every punctuation get its own token and you can just do all the merging the problem is that so there's an efficiency question training these tokenizes takes a long time so you better off because you have to consider every pair of token so what you end up doing is saying if there's a space this is very pre- tokenizes are very English specific you say if there's a space we're not going to start looking at the token that came before and the token that came afterwards so you're not merging in between spaces but this is just a optimiz a computation optimization you could theoretically just deal with it the same way as you deal with any other character and yeah when you merge tokens do you delete the tokens that you merged away or do you keep the smaller tokens that merge you keep the smaller tokens in reality it doesn't matter much because usually on large Corpus of text you will have everything but you usually keep the small ones and the reason why you want to do that is because if in case there's as we said before you have some some grammatical mistakes so some typos you still want to be able to represent these words by character so yeah yes are the tokens unique so say in this case T Ken is there only one occurrence or could do you need to leave multiple occurr so they could have take on different meanings or something oh I see what you say no it's every token has its own unique ID so a usual this is a great question for example if you think about a bank which could be bank for money or bank water it will have the same token but the model will learn the Transformer will learn that based on the words that are around it should associate that I'm saying I'm being very high wavy here but associate that with the with a with a representation that is either more the bank money side or the Bank water side but that's a Transformer that does that it's not a tokenizer yes yeah so you mentioned during tokenization keep the smaller tokens you started with if you start with a t you keep the T and then you build your tokenizer to the that you can now in token so let's say maybe you didn't train on token but in your data you are trying to encode token so how does the tokenizer know to encode it with token or a great question you when you so when you tokenize so that's after training of the tokenizer when you apply the tokenizer you always choose the largest token that you can apply so if you can do token you will never do T you will always do token but there's so people don't usually talk that much about tokenizers but there's a lot of computational benefits or computational tricks that you can do for making these things faster so I really don't think we and honestly I think a lot of people think that we should just get away from tokenizers and just tokenize character by character or bites by bites but as I said now there's this issue of length but maybe one day in five or 10 years we will have different architectures that don't scale quadratically with the length of the sequence and maybe we'll yeah move away from tokenizes so can you share with us the drawback why do people want to move away from the tokenizer oh yeah so think one good example is math if you think about math numbers now are not tokenized so for example 327 might have its own token which means that models when they see numbers they don't see them the same way as we do and this is very annoying because what the reason why we can generalize with math is because we can deal with every letter separately and we can then do composition where that if you add stuff it's just the same thing as adding every one separately plus whatever the unit that you add so they can do that so then you have to do special tokenization and one of the big changes that GPT 4 did is changing the way that they tokenize code so for example if you have code you have often in Python these four spaces at the beginning those were dealt with strangely before and as a result the model couldn't really understand how to deal with code so toiz a lot okay so I'll move on now but we can come back later on token Isis great so we talked about the task the L the tokenizer let's talk a little bit about evaluation so the way that LMS are usually evaluated is what we call is using what we call perplexity at a high level it's just your validation loss the slight difference with perplexity is that we use something that is slightly more interpretable which is that we use the average per token loss and then you expon entiate it and the reason why you exponentiate it is because you want the loss has a log inside and you one humans are pretty bad at thinking in log space but two logs depend on the base of the log while when you exponentiate you have everything in the the vocabulary size unit and the average proten is just so that your complexity is independent of the length of your sequence so perplexity is just two to the power average of the loss of the sequence so perplexity is between one and the length of the vocabulary of your tokenizer one it's simply well if you predict perfectly the thing which every word then every word will have product of ones so the best perplexity you can have is one if you really have no idea you predict with one divided by size of vocabulary and then you do simple math and you get perplexity of size of vocabulary so the intuition of perplexity is that the number of tokens that your model is hesitating between so if you if your model is perfect it doesn't hesitate it know exactly the word if it really has no idea then it hesitates between all of the vocabulary so perplexity really improved that's perplexity on a standard data set between 2017 and 2023 it went from 70 tokens to less than 10 tokens over these five six years so that means that the models were previously as dating between 70 words every time it was generating a word and now it's as dating between less than 10 words so that's much better perplexity is not used anymore in academic benchmarking mostly because it depends on the tokenizers that you use it depends on the actual data that people are evaluating on but it's still very important for development of llms so when you when you train your own llm people will still really look at the perplexity one common other way and now more common in Academia of evaluating these llms is just by taking all the classical NLP benchmarks and I'll give you a few examples later and just aggregating everything so collect as many automatically evaluatable benchmarks and just evaluate across all of them so one such if or two such benchmarks of what we call Helm which is from Stanford and another one is the hugging face open LM leader board which are the probably two most common ones now so just to give you an idea in Helm there are all of these type of tasks which are mostly things that can be easily evaluated question answering so think about many different question answering tasks and the benefit with question answering is that you usually know what is the real answer so you can the way that you evaluate these models and I'll give you a concrete example in one second is that you can just look at How likely the language model is to generate the real answer compared to some other answers and that's essentially at a high level how you evaluate these models so to give you a specific example mlu is probably the most common academic Benchmark for llms and this is just a collection of many question and answers in all of those domains for example College medicine College physics astronomy and these type of topics and the questions are things so this in astronomy what is true for type 1 a supernova then you give four different potential answers and you just ask the model which one is more likely so there are many different ways of doing it either you can look at the likelihood of generating all these answers or you can ask the model which one is the most likely so there are different ways that you can promp the model but at a high level which one is correct and there are three other mistakes yes kind creating is unconstrained text as the output yeah how do you evaluate a model if it give something that's semantically completely identical but is not the exact token list that expect yeah so that's a great question I'll talk more about that later here in this case we don't do unconstrained so the way you would evaluate MML is either you ask the first question and then you look at the likelihood of the model generating a the likelihood of the model generating b c and d and you look at which one is the most likely or you can as the model out of ABC d which one is the most likely and you look at whe the to the most likely next token is A B C or D so you can strain the model to say it can only answer these four things you say you constraint the model you mean you constraint The Prompt or do you mean of its whole probability distribution outputs you only comparing the outputs you're only comparing the a so in the second case I gave you would do exactly the I you would do both you would prompt the model saying ABC or D plus you would constrain to only look at these two these four tokens in the first case you don't even need to generate anything so in the first case you just look given that it's a language model it can give a distribution over sentences you just look at what is the likelihood of generating all of these words what is the likelihood of generating the second choice and you just look at whether the most likely sentence is the real answer so you don't sample from it you really just use P of x one to excel does that make sense that being said evaluation of open-ended questions is something we're going to talk about later and is really important and really challenging yes earlier you mentioned that metrics flexity are not are not usually used because it depends on how you do your terization some design choices I was wondering if you could speak more to that oh yeah so think about perplexity I told you perplexity is between one and vocabulary size so now imagine that Chad GPT uses a tokenizer that has 10,000 tokens but Gemini from Google uses a tokenizer that had 100,000 potential tokens then the Gemini one will have the upper bound of the perplexity that you can get is worse for Gemini than for Chad GPT does that make sense so that's just an idea it's a little bit more complicated than that but that's just one first or the bit of you can see that the tokenizer matters great okay so evaluation challenges there are many I'll just talk about two really briefly one as I told you there are two ways of doing evaluation for these mlu there are many more than two but I give you two examples and it happens that for a long time even though that was a very classical Benchmark that everyone used different different companies and different different different organization were using different ways of evaluating mlu and as a result you could you get completely different results for example Lama 65b which was the first model of meta in the Lama series had on Helm 63.7 accuracy but on this other Benchmark had 48.8 so really the way that you evaluate and this is not even talking about prompting this is really just the way that you evaluate the the models prompting is another issue so really there are a lot of inconsistencies it's not as easy as it looks first thing yeah sorry how can we make sure that all these models AR trained on The Benchmark okay second thing this is a great question chain test contamination this is something which I would say is really important in Academia in given that the talk is mostly about training large language models for companies it's maybe not that important CU they know what they trained on for us we have no idea so for us it's a real problem so there are many different ways of trying to test whether the test set sorry whether the test set was in the training Set one cute trick that people in the lab on T lab have found is that what you can do is that given that most of the data set online are not randomized you can just look at and in that language models what they do is just predict the next word you can just look at the entire test Set what if you generate all the examples in order versus all the examples in a different order and if it's more likely to generate a thing in order given that there's no real order there then it means that probably was in a training set does that make sense so there are many that's one of them there are many other ways of doing it train test contamination again not that important for development really important for academic benchmarking great so there are many other challenges but I'll move on for now great data so data is another really big topic at a high level people just say oh you train large language models on all of Internet what does that even mean so or people sometimes say all of clean internet which is even less defined so internet is very dirty and really not representative of what we want in practice if I download a random website now you would be shocked at what is in there it's definitely not your Wikipedia so I'll go really briefly on what people do I can answer some questions but data is on its own is a huge topic first what you do is download all of Internet what that means is that you use web crowlers that will go on every web page on Internet or every web page that is on Google and that is around 250 billion pages now and that's around one petabyte of data so this is a common C is one web crowler so people will usually write their own web crowlers what they do is that they use standard web crowlers and we common crawl is one of them that every month adds all the new websites that were added on internet that are found by Google and they put it in a big a big data set so that's on common call you have around 250 billion pages now so 1 E6 gigabytes of data once you have this so this is a random web page random from this common craw and what you see is that one it really doesn't look at type of things that you would usually see but so this is an HTML page it's hard to see but if you look through you will see some content for example here tesing world is your ultimate source for the system X high performance server and then you have three dots so you don't even the sentence is not even finished that's how a random internet looks so of course it's not that useful if you just train a large language model to generate things this so what are some of the steps that are needed first one you extract the text from the HTML so that's what I just try to do by looking at the correct text there are a lot of challenges by through this for example extracting math is very complicated but pretty important for training large language models or for example boiler plates a lot of your forums will have the same type of headers the same type of Footers you don't want to repeat all of this in your data then you will filter undesirable content so not safe for work harmful content pii so usually every company has a black list of websites that they don't want to train the models on that Black List is very long and you say if it comes from there we don't train on this there are other ways of doing these things is that you can train a small model for classifying what is pii removing these things it's hard every Point here that I'm going to show you is a hard amount of work but I'm going to go quickly through it so filter undesirable content second or fourth is the dup D duplication as I said you might have things headers and Footers in forums that are always the same you want to remove that another thing that you might have is a lot of URLs that are different but show the same website and you might also have a lot of U paragraphs that come from common books that are duplicated a thousand times or 10,000 times on internet so you have to duplicate also very challenging because you have to do that at scale once you do duplication you will do some heuristic filtering you will try to remove low quality documents the way you do that are things rules-based filtering for example if you see that there are some outlier tokens if the distribution of tokens in the website is very different than the usual distribution of tokens then it's probably some outlier if you see that the length of the words in this website is super long there's something strange going on that website if you see that the website has only three words maybe is it worth training on it maybe not if it has 10 million words maybe there's something also wrong going on that page so a lot of rules this yes why we filter out undesirable content from our dat set instead of putting it in is a supervised loss can we not just say here's this hate speech website let's actively try to Let's actively penalize the for generating we'll do exactly that but not at this step that's where the posttraining will come from pre-training the idea is just to say I want to model how humans speak essentially and I want to remove all these headers photos and menus and things this but it's a very good idea that you just had and that's exactly what we'll do later Next Step modelbased filtering so once you filtered a lot of data what you will do that's a very cute trick you will take all of Wikipedia and you will look at all the links that are linked through Wikipedia p because probably if something is referenced by Wikipedia it's probably some high quality website and you will train a classifier to predict whether something comes from whether a document comes from one of these references from Wikipedia or whether it's from the random web and you will try to say I want more of the things that come from Wikipedia references does that make sense so yeah so you will train a machine learning model usually also very simp simple models because you need to do that really at scale just think about the 250 billion Pages next one you will try to classify your data into different domains you will say okay this is entertainment this is books this is code this is these type of domains and then you will try to either up or down weight some of the domains for example you might say you might see that if you train more on code then your model becomes bettered on reasoning so that's something that people usually say in a very handwavy way if you train your model more code it helps reasoning so you want to upweight the coding distribution because that helps for General language modeling skills books is usually also another one that people usually upweight entertainment they usually downweight so things this of course you want to do it so people used to do it maybe theistically now there's entire pipelines that we'll talk about of how to do these things slightly more automatically and then at the end of training usually train after training on all of this data that we saw usually train on very high quality data at the end of training your large language model where you decrease your learning rate and that means that you're overfitting your model on a very high quality data so usually what you do there is Wikipedia you overfit on Wikipedia yeah and you overfit on human data that was collected the other things continual pre-training for getting longer context I'm I'm going to skip over all of these things but I just to give you a sense of how hard it is when people just say oh I'm going to train on internet that's a lot of work and really we haven't figured it out yet so collecting World data is a huge part of practical large language model some might say it's the key yes about data so basic question so usually when you start with the terabyte of data after I go through all that steps the typical amount of data you have in and then how large a team does it typically think to go through all the steps you talk about so how is the question how large is the data after you filter yeah after you filter and then to go through all the step how large a team do you need to go through the other fation sttion how slow is it or how how many people would you need to be able to do this okay that's a great question I'm going to somewhat answer about the data how large is the data set at the end of this slide for number of people that work on it that's a good question I'm not quite sure but I would say yeah I don't quite no but I would say it's probably even bigger than the number of people that work on the two tuning of the pre-training of the model so the data is bigger than the modeling aspect yeah I don't think I have a good sense I would say probably in Lama's team which have 70 years people I would say maybe 15 work on data I yeah all these things you don't need that many people you need a lot of computer so because for data you need a lot of CPUs so yeah and I'll answer the second question at the end of this slide so as I just alluded to really we haven't solved data at all for pre-training so there's a lot of research that has to be done first how do you process these things super efficiently second how do you balance all of these different domains can you do synthetic data generation that's a big one now and because we don't have we'll talk about that later we don't have enough data on the internet can you use multimodal data instead of just text data and how does that improve even your text performance there's a lot of seccy because really this is the key of most of the pre-train pre-trained large language models so for competitive Dynamics usually these these companies don't talk about how they do the data collection and also there's a copyright liability issue they definitely don't want to tell you that they've trained on books even though they did because if not you can sue them common academic benchmarks so that will answer what you asked it started so those are the smaller ones it's the names are not that important but it started from around 150 billion tokens which around 800 GB of data now it's around 15 trillion of to 15 trillion tokens which is also the size of the models that are now the best models are probably trained on that amount of data so 15 trillion tokens which is probably I guess two order of manage bigger than that so 80 E3 gab so that would be around 100 to thousand times filtering of the common crawl if I'm not mistaken so yeah one very one very famous one is the pile so this is academic Benchmark of the pile and we can just look at what distribution of data they have it's things archive PBM Central which is all the biology stuff here it's Wikipedia you see stack exchange some GitHub and some books and things this again this is on the smaller side so this is if we look at here this is on 280b so in reality it's 100 times bigger so you cannot have that much of GitHub and of Wikipedia in terms of close Source models just to give you an idea Lama 2 it was trained on 20 two trillion tokens lamb 3 15 trillion tokens which is currently the best model that we know on how much it was trained on which is the same thing as this the the best academic or the biggest academic Benchmark which is 15 trillion tokens GPD 4 we don't really know but it's probably in the same water of magnitude or it's probably around that it's probably around 13 from leaks if the leaks are true great so scaling laws any other questions on Data before you go to scaling laws sorry I know I'm giving you a lot of information but there's a lot into training at large language models great scaling laws so the idea is that what people saw around 2020 or at least from a long time but they've been able to theoretically show it or impurely show it since 2020 is that the more data you train your models on and the larger the models the better the performance this is pretty different than what you've seen in this class in this class we teach you about overfitting doesn't happen with large language models larger models better performance it's something that really took a long time for the community who took this type of class to realize but for the exam overfitting exists so okay the idea of scaling laws is that if given that that more data and larger models will always give you better performance can we predict how much better your performance will be if you increase the amount of data and the size of your model and surprisingly it works so here you see three plots from a very famous paper called scaling loss from openi here you see on the x-axis compute so how much did you train how much compute did you did you spend for training and here you see test loss so this is essentially it's not perplexity but it's your validation loss so it's a log of the perplexity and if you put these two on log scale then you see that the performance or the this the sorry the scaling law is linear that means that if you increase your compute by a certain amount you can you can say by how much your test loss will decrease same thing with data and same thing for parameters if you increase the data set size your loss will decrease by an amount that is somewhat predictable if you increase the number of parameters it will decre the loss will decrease by amount which is somewhat predictable this is really amazing very surprising it looks in nocuous when you look at these type of plots but that's crazy because it means that you can predict how well we're going to perform in 2 3 years depending on how much compute we will add assuming that these things will hold there's nothing theoretical about it yes two things one what is the loss that they're using here is this perplexity or so it's it's I said perplexity was two to the power of the LW so this is the the power of the perplexity and then the second thing is when you increase the number of parameters or you increase the total data set size going dat times doesn't that just inherently increase your compute do all this work to just specific no this is a great question so the compute here is a factor of two things the data and the parameter what I'm showing here is that you can well we're going to talk about that in details but if you increase the number of parameters you should increase the number of data that you have so you don't go multiple times through the same data set no one does EPO in a lar at least not yet because we have still enough data so yeah this is all the same Trend which is increase compute decrease loss yes have we seen the numbers for the last two years or is it still holding it is still holding I don't have good numbers to show you but it is still holding surprisingly yes is there no evidence empirical evidence that you plateau expected PL no empirical evidence of plateauing anytime soon why we don't know will it happen probably it doesn't need to because it's in log scale so it's not as if it had to go it had to Plateau mathematically it could continue decreasing this most people think that it will probably Plateau at some point we don't know when okay so that's I'll talk more about scaling laws now so why are scaling laws really cool imagine that I give you you're very fortunate I gave you 10,000 gpus for this month what model will you train how do you even go about answering that question and this is a hypothetical but that's exactly what these companies are faced with the old pipeline which was you tune High parameters on the big models so let's say I have 30 days I will train 30 models for one day each I will pick the best one and that will be the final model that I will use in production that means that the model that I used was only trained for one day the new pipeline is that you first find a scaling recipe so you find something that tells you for example oh one common thing is that if you increase the size of your model you should decrease your learning rate so you find a scaling recipe such that if I increase the the size of my model here's what I should do with some high parameters then you tune your high parameter on smaller models of different sizes let's say I will say for 3 Days of my 30 days I will train many different models and I would do highper parameter tuning on these small models each of different sizes then I will fit a scaling law and try to extrapolate from these smaller models which one will be the best if I if I train it for much longer or sorry if I train it for a larger model and then I will train the final huge model for 27 days instead of just one day so the new pipeline is not train things or do high prity tuning on the real scale of the model that you're going to use in practice but do things on smaller ones at different scales try to predict how well they will perform once you make them bigger I will give I will give you a very concrete example now let's say Transformers versus lstms let's say you have these 10,000 gpus you will not sure which one you should be using should I be using Transformer based model or LCM based model what I will do is I will train Transformers at different skills so here you see different parameters on the x-axis Y axis is my test loss I will then train different lstms at different scales once I have these points I will see oh it fits a scaling law I will fit my scaling law and then I will be able to predict oh if I had 10 times more compute here's how well I would perform for the LM it's slightly less linear for the lstm but you could probably try to predict where you would end up and clearly from this plot you would see that Transformers are better one thing to notice when you read these type of scaling laws is that are two things that are important one is really your scaling rate which is the the slope of the slope of the scaling law the other thing is your your intercept you could start worse but become better over time it just happens that lstms are worse for both but I could show you another one where things you can predict that after a certain scale you're better off using that type of model than others so that's why scaling laws are really useful any questions on that yeah so these are all very how sensitive are these to small differences in the architecture one Transformer architecture versus another Transformer architecture you have to fit your own curve and make say oh scaling law has tell me there should be some logarithmic function let me extrapolate that for my own yeah so usually for example if you're an academic and you want to now at least that's pretty recent and you want to propose a new activation that's exactly what you will do you will fit a scaling law show another scaling law with the standard I don't know G and you will say that it's better in reality once you start thinking about it in scaling loss terms you really realize that all the architecture differences that we can make the small minor ones all they do is maybe change a little bit the Intercept but really that doesn't matter cuz just train it for 10 hours longer or wait for the next for the next Compu gpus and these things are really secondary which is exactly why I was telling you originally people spend too much time on the architecture and losses in reality these things don't matter as much data though if you use good data you will have much better scaling loss than if use bad data so that really matters another really cool thing you can do with scaling laws is that you can ask yourself how to optimally allocate training resources should I train larger models because we saw that it's better when you train larger models but we saw that it's also better when you use more data so which one should I do should I just train on more data a smaller model or should I train a larger model on less data so chinchilla is a very famous paper that first showed this the way they did it I want to give you a little bit of a sense of what these plots are here you see training loss again on the x-axis you see parameter differences sorry parameter size number of parameters so the size of the model and here all these curves are what we call isof flops which is that all the models on this curve H have been trained with the same amount of compute the way that you do that is that you train you change sorry you vary the number of tokens that we trained on and the size of the models but you vary in such a way that the total compute is constant okay so all these curves that you see with different colors have different amount of computers that were trained on then you take the best one for each of those curves once you have the best one for each of those curves you can ask you can plot how much flops it was and which curve were you on and how much parameters did you use for training that specific point you put that on the on the log scale again and now you fit a scaling law again so now I have something which tells me if I want to train a model of 10^ 23 flops here's exactly the number of parameters that I should be using 100 100b and you can do the same thing with flops and tokens so now you can predict if I tell you exactly I have one month of compute what size of model should I be training F your scaling law and I tell you of course that all looks beautiful in reality there's there's a lot of small things of should you be counting embedding parameters there's there's a lot of complexities but if you do things well these things do hold so the optimal number of parameters that chinchilla Pap have found is to use 20 tokens for every parameter that you train so if you add one more parameter you should add you should train your thing on your model on 20 more tokens so one caveat here is that this is optimal training resources so that is telling me if you have 10^ 23 FL or if you have 100 I don't know how much that is100 million or 10 no that's much less let's say I have $5 million to train my best model that gets the lowest loss how what would I train on in reality these companies need to think about inference also if you have a smaller model they will spend less over time so if you consider the inference cost you have other papers that Tred to show that it's around 150 parameters per sorry tokens per parameters because you prefer having a smaller model cuz over time you're going to you're going to spend less money on inference of these models so 150 to one that's around what the best models are trained on now at least the ones that are that are used in practice for in production great any question on chin great oh sorry in practice how expensive is inference for these models rela to train very expensive I will not talk about inference because that would be another entire lecture but just think about Chad GPT where they have I don't know how much it is now 600 million people that used it that's a lot yeah so it's very expensive there's a lot of optimization you can do for in though and that's an entire other lecture so I'm going to skip that this time but it's very interesting okay tuning as I said there are many things that you can answer with scaling laws I just try to give you two examples but really there are many things what data do you use what mixture what data mixing waiting you use data mixtures that's what we talked about before what architecture you use whether you should make your models wider or deeper should you be paying for more gpus or collecting more data all these things are things you can try to answer with scaling laws one thing I want to say is the bit lesson if you ever heard of Richard sudden a very famous blog post in 2019 what he realized which I think not enough people realize I didn't definitely did not realize at that time is that once you see these type of scaling laws that the more compute you have the better models you will get so with skill you will get better model and you also know by Mo law or these type of variant of Mo law that you will always have better compute then the only thing that matters is just to have architectures that can leverage computation so what matters is systems data and less so the architecture the small architecture differences your your activation and things this so I think that's one of the reasons why most of research focuses on some things that for industry matters less and I was one of those researchers for a large part of my career so don't spend time over complicating do the simple things do it well seal them that's really what openi taught us with with chat gpg and with all the gpts before okay I want to give you some backup the envelope computation so I might be off by a few factors here but I just want to give you a sense of how costly it is to train some of these models I'll give as an example Lama 3 400b which is currently the best open source model that you can get it was trained on 15.6 tokens it has 45 billion parameters so just now that what is this optimal tokens per parameter that's around 40 so that's a little bit more than chinchilla but less than this inference optimal model so they went for training optimality flops for this model so one simple way to compute flops is six times the number of parameters times the number of data you train on so if you do the simple calculation here it's 3.8 e25 flops the reason why this is important is that if you follow the little bit the news there's an executive order from Biden that says that once you have 1 e26 parameters sorry flops then you have special scrutiny on your models so they went 2x less than that so they really went below this to not have special scrutiny so 38 I might be off by a little bit but it's definitely under the 1 26 oh so paramet p is parameters n is data number of tokens this is a this is just an approximation we yeah okay compute and we know that they trained on 16,000 h100s and we know the throughput but they said it too so if you do the computation it takes around 70 days or 26 million GPU hours at least that's with my back of the envelope computation they said that they use 30 million instead of 26 million GPU hours so maybe they had some some challenges I don't really know but if you follow the simple computation it's around 70 days cost this it's hard to approximate but I'm just going to say it's the rent what if I were to rent h100s that many h100s for that many days how much will I pay h100 a lower bound on the on the renting cost of h100 is around 2 hours $2 per hour so if you multiply this by 26 million hours you get 52 million dollars so they probably pay less than that but not much less because all these all these services that rent gpus they don't make that much money so it's it's probably slightly less but not that much less now salary I said 50 employees 500k per year say yeah it's probably the ballpark 25 million so if you put all together around 75 million dollars for training this Slammer model I'm probably off by 10 million but that's bpk carbon emitted a lot of people might ask also the cost is not the only thing that is important so I did the computation it's around 4 4,000 tons of CO2 equivalent that is only 2,000 return tickets from JFK to London so now carbon emitted is not it's huge but it's not meaningful yeah yet I think in maybe GPT 6 gpt7 once you multiply this by 100 that might become a real issue now it's still not I think an issue in the grand scheme of things next model the way you should be thinking about these models is that every new generation the number of flops essentially multiplies 10x or at least that's what they try if they have enough energy and if they can buy enough gpus great any question on these back of the envelope math no okay so now we talked about pre-training I wanted to also chat about systems because now we know computer is really important so there's a question of how do you optimize the how do you optimize your computer I will leave that for the end because I'm not sure how much time we will have I think it's important but hopefully I'll be able to talk about it later it's slightly different than what we've been talking about now so I'll move on to post training for now so the task of post training ER the reason why we need to do Post training is as I told you before it's to make AI assistants so language modeling is not really the thing that you want when you have an AI assistant for example if you ask to gbd3 which is a purely language Model A pure language model not a not an aligned one if you ask a question explain the moon landing to a six-year-old the completion that you would get is something explain the theory of gravity to a six-year-old because what it learned is that on on internet if you have one question you usually have maybe another bullet point of other similar questions you don't usually have question and then answer later this is not what you want from an AI assistant so how do we do this alignment which is this post training and making these models assistance so the goal of this alignment is to get LMS follow the instructions that are given by users and maybe some designers desires so think about moderation you don't want the model open ey definitely doesn't want the model to say stuff that is very toxic so here you see on the left hand side that when you ask a question it provides a real answer so it's not before the llm and on the hand side you see that it would if you ask to write a tweet describing how a certain part of the population are evil it will say that it cannot do that so that's this alignment the background here is that the data that you want for training some of these models is we know what we want which is just asking humans this is a question this is the answer that you want but the thing is that it's very expensive to collect that data and it's hard to find it online in contrast pre-training data is not what you want but there's a lot of it so what we will do a the main idea is simply take a pre-train large language model pre-train all of internet and then you just fine tune so you just change a little bit of weights on the type of data that you want and hopefully given it you already pre-train it on all of Internet it learns or knows how to speak in English and knows a standard language syntax then you can really find tune in with very little data okay sft so supervis fine tuning is really exactly what I just said which is the idea of fine-tuning the large language model on the desired answers that are collected from humans so why is it called supervis fine tuning because you want to do language modeling on the real ansers so language modeling is this next word prediction and that's the fine-tuning part and then you want to do it on desired answers given by humans so that's why we call it supervis so how do we collect this data well we I just said it you just ask humans to tell you this is the this is a question this is the answer that you you would want from some of these models so this is an example sorry I can't read very well on my computer but my kid needs to do a science no let's read this one can you write a short introduction about the relevance of the term monopsony and then it says monopsony refers to a market structure blah blah and that's a human that wrote that so this is open Assistant which was a way to collect data online by humans so this type of supervised fine tuning or alignment is really the key of Chad GPT this is what made the big jump from gpt3 which was mostly something that was known by AI researchers to Chad GPT which became known by everyone so the problem with human data is that it's very slow to collect and very expensive so one possible simple idea is to use llms to scale data collection so that's exactly what we did with alpaca one year ago what we did is that we asked humans or we use a data set of human question answers so there were 175 question answers here and we asked the best mod at the time so text3 to generate many more of these question and answers so all we did is this is what humans would write now write similar answers and similar questions and we collected 52,000 LM generated question answers and then what we did is simply we took Lama 7B which was the best pre-train model at the time and we just fine- tuned this with supervised fine tuning as I told you and that's how we got the Alpac s7b model and this is the type of data that we collected so things what does algorithm mean an algorithm is a step by a stepbystep set of instruction used to solve a problem or achieve a goal blah blah so the data is not it's pretty good given it was LM generated by LMS from essentially two generations ago so that really started at least for us as an academic replication of chat GPT now it really there's a big field of synthetic data generation of how to use llms to make development of llms faster and by by decreasing the amount of human hours that you need quantity of data so we talked about what type of data and how we collect it one thing which is surprising with sft is that you don't need that much data so what this paper showed this is called Lima is that if you have if you scale the amount of data that use from supervised fine training from 2,000 to 32,000 it really doesn't help much so here scaling laws definitely don't help so the intuition here is that all you learn is you learn how to format your desired answers another way of saying it is that your pre-trained models they essentially model the distribution of every user on internet one that might write bullet points another one that might answer qu answer question with an answer so all you tell your model is wait you should be optimizing more for this type of user than another one so you're not teaching it and you're not teaching anything through this sft so supervis fine tuning all you do is you tell the model to optimize for one type of user that it saw already in a pre-train data set so the knowledge is already in the pre-train llm and you just specialize to one type of user great any question on sft yes so I know it's a big issue with synthetic data where if you keep generating data from the same distribution eventually you're not learning a new distribution you're essentially playing with it just bootstrapping that yeah surely you can't scale that forever you can't keep going on and generating from the same distribution you hope to learn something new yeah so are there it's an active area of research but any thoughts that you have around how people are maybe thinking around this and better ways to bootstrap or to give up on this idea and realize that the chart shows you don't need that many so just get humans to generate 2,000 really good yeah so that's a very good question so for the data stuff so I'm saying it's not that important for sft but there will be another thing we'll talk about after where data does matter my intuition based on not that much empirical results is that you can still get even though you use your LMS if you use purely LM generated text and you do that for three four generations of llms I agree with you that probably you won't improve much but for me what is important is how do you use human in the loop with llms not purely LMS not purely humans but maybe what you can do is just have the model generate some new text and just humans write a few Edits are much faster than writing the entire text and I think that if you have that type of collaboration then from an information theoretical point of view you still get additional information but you still much faster than if you use humans and I think that as a field we'll probably move towards these type of things which is really just finding the examples that are important and asking humans it's active learning just asking humans exactly when you need to get inputs yes do we train with the same loss function the same General training algorithm for the supervis tuning bit as we do for the for the pre-training because the examples you showed I think the important thing of the good examples is they're supera accurate there's these more complex still just chain same so that's why here I yeah I didn't maybe didn't emphasize enough this is just language modeling fine tun the LM with language model on the desired answers so this is the same loss it will be different in two seconds but the first step of sft is the same loss where you just say Okay I want to specialize on that type of data so there's even a question of what is pre-training what is post-training because in reality it's just a different data that you use the reason why we usually call it post training is that the way we collect that data is very different great questions yes maybe it's the same question but why would these 2,000 examples have such an overweighted influence you tun so that's why we also that's another reason why we call it post training is that we use different type of hyper parameters so I told you at the end of pre training you essentially end up with a learning rate of zero and here you're going to increase your learning rate so 1 eus 5 one E Yeah and so the weight that you give to them is different okay Second Step or second part of this post training is what we call reinforcement learning from Human feedback or rhf some of you might have heard of that the idea is that sft has a problem namely that you do behavioral cloning which means that you just try to clone what the humans would say and that had that has many issues one of them is that you're bound by human abilities so if humans humans won't generate the things that they think is the best thing to generate so if you ask me to write a book I can definitely enjoy a book I can probably say one book is better than another but I'm definitely not going to be as good as writing the book that I want to read so you're going to be bound by the human ability to generate things even though the humans might be better at distinguishing between things that's one issue number two I find that pretty interesting is that it might if you ever heard of the word hallucination so this is llms generating F false information hallucination might these people have hypothesized that can come from the supervised fine tuning even if you do supervised fine tuning on data that is correct and the reason why that is that if given I told you that sftt is with very little data and it's with data that doesn't the model doesn't learn anything new so what if the human gives an answer that the model didn't know was true from the model perspective you the human is telling the model generate this thing that seems plausible but have no idea if it's true or not so just to give you a very concrete example if we go back to this monopsony example can you write blah blah about monopsony imagine that a human wrote a reference on this type of book and that book might exist that might be a correct reference but what if the llm never saw this reference during pre-training then it doesn't know that it's a correct reference so really what you tell the model is to generate or make up some plausibly sounding reference rather than tell the real reference that it saw during pre-training so hallucination might be a re might be caused by this sft that's problem number two does that all make sense great problem number three price generating the ideal answers is very pricey and that comes back to your question of humans writing answer is pretty expensive so that's where rhf comes in the idea is that instead of cloning the behaviors of humans we're going to maximize human preference and the way we're going to do that so the pipeline is that for a certain for every instruction you're going to ask a model to generate two answers and usually use a pretty good model so you usually don't use an LM here you use a sft fine tune you use a fine tuned llm already to give pretty good answers and then you ask labelers which of these two answers was better so select the preferred one and then with different type of algorithms we're going to talk about the algorithms you just fine-tune the model to generate more of the green thing than the red thing so more of the good stuff so now the question is how and we're going to talk about that now so there are two ways that we're going to talk about and two that are mainly used in the community the first one is simply the idea of using reinforcement learning so hopefully you all know what reinforcement learning is now so when you think about using reinforcement learning one important question is what is the reward that we're optimizing so in this case there are really two options that I could think about the first one you could just say I'm going to compare the output generated by some baseline the output generated by my model U and I'm just going to ask the human to say which one is better and I'm going to use this as a reward so if I'm better than the Baseline this is a plus one if not it's a minus one so now it's binary reward the problem with binary reward is that it's very sparse and you don't get much information out of it maybe your answer was slightly better maybe it was way better and you don't really know from this how much better it was so option two is that you can train what we call a reward model which is simply a classifier so you use machine learning to classify how much better two outputs are from the preference from the perspective of the human so this is a little bit meta but what you do is that you train you take a reward model R which is a just a large also a large a large classifier and you ask this reward model you give it the input and the actual output that you have one of the two outputs and you just exponentiate that so that's the soft Max law that you all know about and now you divide by the exponential reward on the first example sorry on the first output and this is on the second output and you train so the reason why you do that is that you train your model you train this reward model to be able to classify how much better one output is to another one so another slightly less convoluted way of saying it is that your reward model will output some reward that will be used as the logits of your soft Max so now if you have high logic in your softmax it means that you highly likely this output is better so that's what we call Bradley ter model yes is this reward model going over the entire output or is it going so this takes the entire yeah this takes the entire output at once so it takes all the input and all the output and it gives one number yes would human be sorry with the reward model where would a human be oh I see okay sorry maybe I wasn't clear you train this reward model to fit this green and red preference from humans so you train a classifier to say whether the humans prefer red or green but instead of using the binary reward which is what the human would tell you use the logits of the soft Max and the thing with the logits is that logits are continuous so now that if your reward model said it has high logits then in some ways the human highly prefer this answer to some other answer great so as I just said continuous information so it's better so that's what people use in practice or at least used to use in practice I'll tell you about the other algorithm later so what you do at the end is that you try to just use reinforcement learning that about now we know we have reward what you sample through is the generation from your large language model and then you just use some regularization term so the reason why you do this regularization term is for avoiding what we call over optimization so this reward model might not be really represent might not perfectly model human preferences so you don't want to maximize this thing to essentially Infinity and you do it using po which is a common reinforcement learning algorithm one thing to note here because it will be important for later is that when we use maximum likelihood sorry now the large language models are a policy for your reinforcement learning it's not maximizing maximum likelihood anymore which means that you're not modeling any distribution anymore and the reason why this is important is that models that went through this type of Po don't give you likelihoods of text that are meaningful cuz what you optimize them to do is B just optimized for generating the most likely thing not optimize for modeling all the answers that humans might say another way of saying that is that there's nothing that incentivizes here the model to not give a a a single possible generation nothing here says it's good if you have some distribution with some entropy okay if you haven't followed it's not that important but just good to knowe great so PO is exact what chat GPT did originally so here's the on the blog post or what they have is step one do supervise fine training which now you all know about step two train a reward model on human preferences step three do po multiple steps which is where you see this blue arrow so you continue you train the model once with po you collect new data you continue and that's why and that's exactly what Chad GPT did that was a big breakthrough between gpt3 and Chad GPT one thing to note is that P has many challenges reinforcement learning is something that's super nice theoretically in practice anyone who ever worked with reinforcement learning knows it's such a mess there's a lot of things roll outs out of Loops clipping so many complications so it's messy this is the idealized PO used for LM settings so that's already much more complicated than this expectation we saw before and in practice it's much more complicated so we have one implementation of it that we had to do and I'm not going to go through it but you have so much stuff that you have to think about when you implement that type of po algorithm so you have clipping everywhere you have a lot of complexities and things are not well documented all this to say that we're going to there was a new method that was proposed also from Sanford one year ago called DPO which is essentially a simplification of Po and the way what they did or the idea that they have is that instead of using reinforcement learning you can just maximize the probability of generating the stuff that you and minimizing the probability of the stuff that you don't so if you think about the human preference the red and green maximize green minimize red so the loss is this one where what you see this is simply some log of the model so this is the likelihood of a model generating the things that the human preferred given the inputs and what you try to do is maximize the likelihood of generating the things that you minimize the likelihood of the things that you don't all the rest of the terms here it's not too important it's really not that complicated to understand but at a high level it's really just maximizing the things you minimizing the rest and one thing to note which I was going to say just here is that all the rest is chosen such that the global Minima of Po and a global Minima of this DPO under some assumptions are essentially equivalent so this is the thing to do mathematically I'm not going to go through the derivations but that's the thing to do it's pretty different with Po in the sense that now and with P what you had to do is collect the human preferences then train a reward model with maximum likelihood then use reinforcement learning now all you do is maximum likelihood much simpler yes yeah so it seems this is a much simpler and B what you just intuitively do if this why did they start with this reward model what led them doing that I think it's a great question I don't really know what I can tell you is that at open ey the people who did the who did this PP sorry who did Chad GPT initially are the ones who wrote Po and I think they were just there are a lot of reinforcement learning people and I think that for them it was very intuitive so there's also some additional potential benefits for example I don't want to yeah for example if you use the reward model the cool thing here with reinforcement learning is that you can use unlabeled data with the reward model so here you can only use the label data for doing DPO for PP for po you first train your reward model and then you can use unlabeled data where the reward model will label this unlabeled data so there's additional potential there could be potential improvements in practice it happens at down and on and I think just that a lot of people in this team were reinforcement learning experts including the main author of Po John hman so much simpler in poo and is performs as well so now this is the standard thing that people use at least in the open source Community I believe it's the standard also in Industry so that's called DPO gains so those are all the papers on the left here this is on a summarization task you see all I want to show you is that the pre-train models were okay and they improve with scale if you do supervised fine tuning you improve them a little bit more if you do po or something with all HF with human feedback you get performance that are as often times depending on a benchmark even better than humans so this is the human reference summaries same thing this is on a on a paper that we have Alpaca Farm where we see the evaluation here is not too important but you see pre-train model you jump to sft and then you jump to PPO and popo have the exact same performance so all HF helps that's the conclusion and DPO is simple data the way that you collect that type of data first idea is just use humans as we already talked about guidelines are very complicated for what humans should be labeling and it's really not that easy and if you ever do some of the labeling you will see that it's extremely complicated if I zoom in to this here I have a question tell me about self-driving cars and you read both self-driving cars are vehicles that are capable of detecting their surroundings blah blah self-driving cars are cars that are equipped with sensors blah blah to navigate without the need for a driver both seem okay which one is better it's hard to say at a glance and as a result the problem with humans is that you will start optimizing a lot of high level features for example the second one is longer I can guarantee you that most humans will choose second one even though maybe the first one is better I don't know I haven't read it carefully so challenges with humans first slow and expensive second as I just mentioned it's hard to focus on things that matter correctness and people usually look at things that don't matter as much the form length and as a result so what I show here is that when you do lhf the more you do of lhf the longer the output of the of the models become so if you've ever been annoyed at chat GPT answering you super long sentences this is because of all rhf annotator distribution shift the distribution of annotators that you use matters a lot and you have to think what is what is even the humans that we want to represent in these models now the question is crowdsourcing ethics usually these a lot of the labeling that is done the people who do them are not paid well and they have to go through a lot of toxic data because you want the model to avoid saying the toxic data so crowdsourcing ethics too so many challenges with human data so what we did also last year is again the same thing as alpaca just the idea of oh well they're challenges with humans maybe we can just replace them with llms so what we did is simply replace oh I see that I'm just realizing that the slides are not sented anyways you replace a human preference with LM preferences so here on this figure you see on the xaxis the price that we paid for collecting human data it's around $300 for 1,000 examples and this is on mechanical turkers which are usually cheaper than maybe some of the other companies that you could go through and on the Y AIS it's the agreement with other humans with the mode of other humans and what you see is that as I told you before labeling is really complicated humans agree with themselves only around 66% of the time on a binary Tas and it's not that the humans are not good here because we were five main authors on this paper we tried to label this data ourselves and we only had say 67 or 68% accuracy even though we talk we talk for 3 hours of how we should be doing labeling really it's complicated it's not an easy task and here I just showed many different models and you see that models are much cheaper and they can get higher agreement with the mode of humans than human humans themselves and the reason why is because humans have a lot of varant models have no varant so they might be a little bit more biased but have less virence so it works surprisingly well and now it's the standard in open Source Community I think even in Industry a lot of people use both humans and llms for improving the colle collection of allf data and this is this is the paper from last year but honestly now it's more that llms would be around this agreement and this cost so around I would say 50x cheaper than humans and better agreement with human than humans themselves okay so that gets us to evaluation of post training that goes back to your initial question at the beginning of the lecture how do you evaluate something chpt the answers that chpt could give are unbounded and it's not that there one answer there are many answers that are just as good so there are many challenges one you can't use validation loss because one method might use po the other one might use DPO validation loss is not comparable second you can't use Cal sorry perplexity that's the thing I told you before these models are not calibrated they don't give distributions they just optimize for one thing so you can't use perplexity for evaluating these type of models once they're aligned sorry one Z lined third there's a large diversity of questions that human might ask to these models generation open QA some question answering some summarization and all of these things so there's so many things you have to cover then the tasks are really open-ended so it's very hard to automate so that's what you were alluding to before so the idea is that instead of trying to come up with really easily automated benchmarks it's just we're going to ask questions that users ask to these models in practice and we're just going to ask annotators to say between these two models which one is better what's the what's the better output so do exact same thing as the data from rhf but you use it now for evaluation yes I'm not sure I understand what you mean by can't use perplexity and not calibrated LM is still doing next token prediction so I can't so think about the optim solution after doing PO is one model that gives you essentially a Delta says that there's only one sentence that is that could be generated for that question so now if you use it on something that is slightly semantically differently different it would give a likelihood of zero for that answer so in reality it's not that extreme because as you say it's still a distribution but I just shows you that there's a there's a fundamental issue with perplexity once these models are not llms anymore they were not trained at least with P they were not trained to do maximum likelihood anymore they were trained to be policies okay so probably the most common or the most yeah the most common Benchmark or the most trusted one is what we call Chad sorry chatbot Arena which is go on internet have random users on the internet blindly talk with two chat Bots just ask many questions see the two answers and rate which one is better and you do that over hundred of thousands of users and then you get the actual preferences and you get rankings of models so you can go now on chatbot Arena and interact with these models one potential issue just to highlight is that while people who want to do these type of things are usually more Tech driven or techsavvy so a lot of the questions that you will ask are more Tech stuff discussing software errors inquiries about AI tools and all these things so another issue is cost and speed if you really want to use something this for development process it will be too costly because you would need to pay a lot of humans to do that so one simple idea is again as we said many times just use LM instead of humans you probably know the drill at this point steps for every instruction generate outputs by some baseline and the model that you want to evaluate so here you imagine that I'm comparing an answer from Chad GPT and from I'm just asking a model another model which one is better and I just average that out yeah I asked gp4 which one is better I average that out over my entire distribution over my entire Benchmark or data set and that gives me a RN rate so RN probability for one model compared to another one and now you can rank models and this is the Alpa eval leaderboard so the benefits of this is that we show we get 98% correlation with Chad B Arena so very high correlation with humans so this is yeah comparison with correlation with other benchmarks and it takes less than three minutes and less than $10 to run so it's pretty cheap there are downsides though one of them is purus correlation so as we already saw before LMS prefer this is one SP correlation not many I'll just talk about one LMS prefer longer outputs humans also prefer longer outputs but the problem or the issue once you use llms is that once there bias you will continue optimizing that humans at some point I can guarantee you if I ask a simple question and you give me five pages of answers I'll be no I don't that answer but LMS if they have this bius and they were trained for that they will continue preferring longer outputs so here we see the preference just showing that humans and models prefer longer outputs and here is another view of the initial apaka eval data Benchmark where when we asked when we rank gp4 when we look at the Run rate of gp4 versus gp4 itself if we com if we use the standard GPT 4 it gets 50% by definition because we're comparing GPT 4 versus gp4 but if we ask a gbd4 to be slightly more verose so we just say in the prompt be Vos in your answers then it gets a r rate of 64.4% so really there's a huge variance and if we ask it to be concise it gets 20% so there's a huge variance depending on whether you ask it to be concise of that's very annoying so one possible solution which is what we did is just use some regression analysis I'm not going to go into details but use Cal inference tools to control for length and now length matters much less so if you ask it to be veros we still get some gains but much less great so that's all about post training and now for the next eight minutes I might talk about systems or just answer questions yes can you go back to your post training in terms of post training how did we tune those parameters using the small body of fine-tuning data and have such big effect on the model you mentioned earlier that there's a different set of hyperparameters are we changing just some of the weights the later weights or all the weights what's happening yeah yeah I skimmed through all of this you change all the weights industry would change all the weights in open source land you might have heard of Laura which is going to change only some of the weights or it to be more specific it's going to add some differences to the output of every of every layer but in Industry you're going to just fine tune all the weights and also to say something else about the data the SL St all HF you usually going to collect a lot more data than with sft so if fft is 5,000 10,000 maybe 50,000 with rhf I think you're going to be more around the 1 million order of magnitude it's still much less than pre-training though yeah because pre-training is 15 trillion tokens this is that's not even a drop and yet you influence the weight a lot so because you do it you have to think that how you do it is you use as I said the learning rate that you're going to use is going to be different but also you only do that so just imagine if I train even if I train on one sentence but over and over again all at some point my model will only that sentence even if it was just one sentence instead of the 15 trillion tokens so if you use a large enough learning rate and for enough time you will overfit that sentence so the the key thing to remember is that the data is not I it's not as if you mix some posttraining data and some pre-training data you do pre-training and then you just start fine-tuning only on the post trining so another way maybe another perspective is that the post the pre-training is just the initialization of your model and once you view it that way that this is just initialization of Weights then there's nothing special you don't need to remember that you train a lot of data before the only thing that matters is that you had an initialization and now I train a model so maybe think about it that way there's a there's a mark of property in some way just you had your weights this is my initialization now I'm training that one does that answer your question but you said something just now about it's almost the equivalence of just rerunning the find tuning data many times is it is that what happens in order to give so much more preference you might I don't know now how they do it in Industry when we did alpaca we had to do three box so you did run it three times to it but even the number of times that you run it through it's not important the only thing the only thing is the is the effective learning rate that what matters so yeah great so I think I have five minutes [Music] okay I might try to give a high level Overview at least from one of the systems trick systems as we said for everyone Bott neck is a sorry compute is the huge bottleneck one question you might ask is why not buy more gpus gpus are expensive but also are scarce even if you have $10 million now you cannot buy the best gpus there's oh yeah there's also some physical limitations when you have when you have multiple gpus you have to communicate between them that takes time so just buying more gpus is not that easy so it's really important to think about how do you allocate resources and how do you optimize your pipeline so system 101 on gpus I'm sorry I'm going slightly faster I hope for that some of you at least can follow gpus are optimized for throughput CPUs are optimized for latency so gpus the way you have to think about it is that there's one Comm there's one command that is run on many Calles at the same time on different type of data so this is how you see a GPU you see there are many different CES we call them streaming multiprocessors which is very different than the usual CPU architecture so just think High throughput paralyzation for gpus gpus are optimized for fast matrix multiplication so every time you will do you will do something on GPU if you can do it with a matrix multiplication it's going to be 10 times faster than with anything else that is a little bit annoying because it means that we're bottlenecked to doing anything with Matrix multiplications another thing to note with gpus is that compute has been improving faster than memory and communication so now gpus usually are hard to keep the data that you send that send to gpus is hard to keep up with the processess so most of your gpus are going to be idle if you just run normal code if you don't optimize your code so communication and this will continue over time another thing to know about gpus is that there's a memory hierarchy this is the same thing with CPUs but the closer you are to your cuse the less memory there is but the faster things run if you're further more memory slower okay I'm going to skip that okay I'm going to say it I told you about this the fact of communication the metric that people usually look at is model flop utilization so what is the theoretical maximum that GPU could run at no more flops that you could use per second divide sorry the number of OB observed through put divided by this theoretical maximum and in general if you reach 50% you're very happy Facebook I looked at Lama was at 45 or something this so that means that data doesn't come fast enough even for these big companies so one simple trick and that might be the only one I'm going to tell you about is low Precision one simple idea is that well if I'm going to put my floats in lower Precision then there's going to be fewer bits that I have to send to my gpus if there's fewer bits it's faster communication lower memory consumption things are going to go faster and for deep learning it just happens that de decimal is not that important so when you do matrix multiplication when you do for example SGD there's already so much noise that if you update something by 0.01 or 0.015 who cares so instead of using 32 bits per float which is what people used to use or 64 for example which is what you would use in other domains you use 16 bits for matrix multiplication so for every float you use 16 bits and for training you have this type of what we call aut atic mix Precision which is that some of the things are in 32 bits others are in 60 bit in 16 bits generally the way you should be thinking about it is that your weights are stored of your model are stored in 32 bits but just before the computation you put everything in 16 bits this you do computation super fast and at the end you update your weights in 32 Bits And the reason why you do all the updates in 32 bits it's just think that if your learning rate for example is very small you still want to be able to make a difference in your weights so all the computation is done in 16 bits but the weights are stored in 32 bits so that's the standard way that people are doing it okay I'll talk just about this and then I'll skip all the rest operator Fusion because I think this is pretty cool as I just said communication is very slow and every time you use a pie torch line it moves variable to Global memory of your GPU so when you have something this x do cosine equal X1 and then you do X1 do cosine what is happening behind the scenes is that you take the X which is data you ship it to your to your actual processes of your gpus you apply the coign you ship it back to the main memory of your GPU and then you see the next sign you ship it back to the computer to the GPU processor you apply another cosign and you ship it back again so another way to see that is that you go from your Dam which is your Global memory in your GPU and you ship it to compute you ship it back for every line This is a naive way of doing it this seems very wasteful so the idea simple idea of operative Fusion is just communicate do all the computation ship it back once and this is exactly what fuse kernels are so if you ever want to make your comp your computations in pytorch much faster just apply torch.
Recap on LLMs
Definition of LLMs
Examples of LLMs
Importance of Data
Evaluation Metrics
Systems Component
Importance of Systems
LLMs Based on Transformers
Focus on Key Topics
Transition to Pretraining
Overview of Language Modeling
Generative Models Explained
Autoregressive Models Definition
Autoregressive Task Explanation
Training Overview
Tokenization Importance
Tokenization Process
Example of Tokenization
Evaluation with Perplexity
Current Evaluation Methods
Academic Benchmark: MMLU
**** · compile on your model this is going to make your model around two times faster and what it does is simply that it rewrites your code your P your py torch code in C++ in Cuda to do the communication only once then do all the operations then ship it back okay I'm not going to have time to talk about tiling is important paration is important and mixture of experts mixture of experts is important Outlook there are many things we haven't talked about we haven't talked about architectures we definitely haven't talked about inference there are many other things that are important with LMS what is the UI that you use arguably chat jpt the big novelty was just have a simple UI to use it multimodality what are all the misuses you could have the fact that there might not be enough data on the internet to train all these models legality of data collection so many other things if you are interested in all these topics I would suggest three classes cs224n is probably the one that touches the least on LMS but it gives some background and historical context of all the LMS and gives some adjacent material CS 324 I think it's called I think it's just called large language models more in-depth reading and lectures on everything I talked about CS 336 which is large language model from scratch you build your own llm it's an amazing class also given by my two supervisors very heavy workload so be careful and great