How LLMs Learn

Learn more in the video:

Surprisingly, many people are still under the impression that AI models are programmed and designed by humans. This was true for older “symbolic” AI systems but it is not the case for the majority of the LLMs you use. I want to clear this up early in the course as lack of a fundamental understanding of how these models are made leads to bad usage of these models. It can lead to people both underestimating AI capability in some areas (and therefore not pushing AI usage to its potential) and overestimating AI capability in other areas (which can lead to user frustration and premature rejection of these tools, or worse to AI mistakes that slip into work outputs).

Most explanations of neural networks, especially the transformer architecture that power LLMs, gets lost in a swamp of mathematical proofs, obscure training processes, complex architecture charts and dense code. That’s all well and good if your dream is to train models from scratch or invent new architectures. But if your goal is to begin building with and using LLMs effectively, the practical know-how matters far more than the finer details of matrix calculus. We believe that having a thorough understanding of the fundamental design of LLMs is no longer always necessary to utilize and build LLM products effectively.

That said, understanding some theoretical concepts — like LLM’s training objective, training data, how they generate words and understanding embeddings — can make a big difference in getting the most out of them. The aim with this article is to give you the big picture, so you can fit new concepts into place as you go. By the end, you won’t feel like you’ve mastered everything, but that’s okay! The key concepts will be revisited later in our course to make sure you get them. Other concepts and less relevant jargon mentioned here will just quietly exit the stage after this lesson, never to be seen again.

How LLMs Fit Into the Broader Field of AI

LLMs didn’t appear in a vacuum — they’re part of a broader attempt to make machines do things that, until recently, only humans could do. Artificial intelligence (AI) is the grand umbrella term for all of this: the quest to make computers recognize patterns, understand language, and make decisions. Inside that umbrella, machine learning is the field that lets computers learn from data instead of following explicit rules. Earlier AI systems, which we call expert systems, were full of hand-coded logic. A bunch of if statements stacked together defined by a domain expert to explicitly tell what to do in which case. Machine learning changed the game: we give models lots of examples, a few hard-coded rules, and let them figure things out themselves.

Within machine learning, we have neural networks, originally inspired by the human brain (which is to say, inspired by a vague cartoon of how we thought the brain works). These networks are just layers of tiny mathematical functions passing numbers around, adjusting connections until they can recognize patterns. When we stack a lot of these layers together, we get deep learning, which is where LLMs sit.

Then there’s reinforcement learning, a method where AI learns by trial and error, getting rewarded for good behavior and correct guesses. Reinforcement learning techniques are often now combined with deep learning to optimize the latest LLMs, allowing us to further train these language models directly thanks to human feedback rather than pre-built large datasets.

Traditionally, making a machine understand language was the job of **natural language processing (NLP),** a field with specialized tools for tasks like translation, sentiment analysis, and text classification. LLMs have now steamrolled through NLP, taking over almost everything — from writing essays to generating code. They’ve also expanded into new territories like image processing and audio generation. If you had asked a researcher ten years ago whether a language model would be explaining quantum mechanics, writing poems, and generating Python scripts, they’d have laughed. And yet, here we are.

Now, let’s get to the crux of LLMs.

At their core, LLMs are sophisticated artificial neural networks. It is not magic, not conscious, and not yet plotting world domination — it is just a very sophisticated and complex function that predicts the next word in a sequence. The human brain has a vast network of interconnected neurons that communicate through electrical and chemical signals across synapses. Similarly, artificial neural networks comprise a network of nodes (or “neurons”) performing simple mathematical operations and are connected by parameters (weights and biases), which are numbers stored in matrices. These parameters are the elements that are adjusted and learned during model training. If this feels like a lot of information at once, don’t worry; by the end of our course, words like parameters and weights will all become a part of your casual conversations, whereas some others are not that relevant.

Think of an artificial neural network as a huge system of mini-computers organised in a grid, all connected, adjusting and refining numbers as data flows through. In artificial neural networks, parameters are stored in matrices, and the nodes are organized into layers. Storing model parameters in matrices allows certain calculations to be made in parallel (here, when I talk about calculations, it mostly refers to matrix multiplications). Each layer processes information by doing those multiplications and passes it to the next layer, refining raw data into meaningful output through a series of transformations, just like a basic (y=8*x+2) function refines any x value into a “meaningful” y value.

Each connection between nodes has a weight. Weights are each number inside these matrices that determines how much influence one part of the network has on another. During training, these weights are tweaked millions (or billions) of times until the model gets very, very good at predicting text.

A representation of a feedforward neural network.

The “deep” in deep learning just means they have many layers, not that they have profound thoughts. The key to LLM’s power lies in their transformer architecture, which lets them process vast amounts of text efficiently in parallel. And this transformer architecture is very data hungry, so for such a thing to start making eerily human-like responses, it needs an absurd amount of training data. This data — text, images, conversations — is gathered from books, websites, everyday conversations and anything else the engineers can get their hands on. These datasets add up to terabytes of information. But raw text and image data isn’t useful until it’s transformed into something the model can process.

That’s where tokenization comes in. The model doesn’t read words; it reads tokens — units of text that might be words, parts of words, or even whole phrases. This process begins with the transformation of raw text or data into tokens using a tokenizer. Tokens are the smallest units of data that the model can work with. In the case of images or audio, tokens could be portions of those files, breaking down larger inputs into manageable pieces.

For example, the sentence “The cat sat on the mat” might be split into the following tokens: “The,” “cat,” “sat,” “on,” “the,” “mat.” Tokens are not always single words or whole words however. They can be just parts of words or multiple words. These tokens are decided based on your training data and try to represent its word distribution as efficiently as possible, grouping frequently occurring characters together, like “ing” or “Hello” that often appears as is.

Once the text is tokenized, the next step is to feed these tokens into the neural network in a way that it can process. Since the model works with numbers, tokens need to be converted into numerical representations that capture the meaning and relationships between them. This is where vectors and what we call embeddings come into play.

A vector is simply a list of numbers that represents a token in a multi-dimensional space. And so is the embedding. These embeddings are essential for capturing the relationships between different tokens. The model doesn’t just need to know that a token represents the word “cat”; it needs to know how “cat” relates to other words like “dog” or “kitten.”

While it’s hard to visualize a space with thousands of dimensions, you can think of it as a complex coordinate system where each token is positioned somewhere in this multi-dimensional space. Here, we see three dimensions, but it’s the same concept if you have thousands of them. The key idea is that the position of each token’s embedding relative to others indicates how similar or related they are in meaning. For example “cat” and “kitten” are close together, while “cat” and “garden” are far apart. But “kitten” and “puppy” might be closer than “kitten” and “dog” because they both mean “baby animal.”

The term embedding instead of vector is used because the model creates a representation of each token that “embeds” it in a specific place in the multi-dimensional space. These embeddings are learned by the model during training.

Now that we have our text into something the neural network can understand, we can finally feed it and start generating new words!

LLMs wouldn’t be possible without transformers, a specific type of neural network architecture that revolutionized machine learning in 2017. Their secret weapon is, for the most part, the attention mechanism, which lets models focus on the most relevant words in a sentence.

The details of the architecture are not generally important when you’re building apps that call, extend, or fine-tune LLMs, so I won’t explain them in detail. However, it does help to understand the basics of the attention mechanism, which was one of the key innovations in the architecture.

Consider the sentence “He sat on the bank and watched the river flow.” If we analyze one word at a time, the word “bank” could mean a place where money is stored or the edge of a river. The attention mechanism in transformers works by examining all the words in the sentence at once, weighing their relationships to one another. In this case, tokens like “river” and “flow” naturally draw more weight, guiding the model to interpret “bank” as the riverbank rather than a financial institution. This process, where every token’s influence is dynamically evaluated, enables the model to focus on the most contextually relevant information and resolve ambiguity effectively. By capturing these contextual relationships and long-range dependencies, attention layers understand the nuances of language and process it in a way that preserves meaning across a sequence of tokens.

A key breakthrough of the attention mechanism is that it allowed all input tokens (say 500 words) to be processed within a single matrix in parallel in each layer — dramatically reducing the computational cost of training large models and making them well-suited to GPUs. This paved the way for much larger models with more training data and capability.

There are further innovations in most of the latest LLMs, such as the Mixture of Experts (MoE) architecture. MoE improves the cost and speed of operating and running production LLMs (a process called inference). In an MoE model, different layers contain multiple “expert” sub-models. These experts are subsets of the model’s nodes/parameters. This allows the model to scale effectively by activating only a subset of experts (and consequently a subset of nodes/parameters), reducing the number of calculations that need to be performed to run the model. Only some of its “mini-computers” are switched on and so it requires less computation to operate. MoEs can leverage specialized knowledge within each expert, enabling the model to handle a diverse range of tasks more efficiently than a monolithic (dense) network that applies the same computations to all inputs. You can think of this as equivalent to a diverse team of human specialists where only the relevant experts are consulted for a specific problem; everyone’s time is not wasted in an all-hands meeting for every problem (hopefully not too often, at least!). As a developer, that means lower inference cost and higher throughput for your product.

💡 See the Transformer Explainer webpage if you wish to confuse yourself further with a graphical representation of the transformer architecture.

Now that we’ve seen how these models are made, let’s dive into the training part with the first step, which we call pre-training.

Unless you plan to train foundation models from scratch, the heavy math (softmax, back-prop, gradient descent) is background; we’ll stay focused on the parts you’ll touch while building LLM-enabled products.

During training, the model engages in a massive game of prediction. You show it part of a sequence (sentences in the training data from web pages, books, etc), and it has to guess what comes next. It starts out clueless, throwing out wild guesses, like a toddler babbling nonsense. But every time it gets something wrong, you slap it on the wrist (mathematically speaking), and you nod approvingly every time it gets something right. This goes on billions of times until the model gets scarily good at making predictions.

Future tokens from the input sequence are “masked,” or hidden to the model, essentially to avoid cheating in this prediction game!

For example, if it sees:

“The cat is on the”,

it might say “mat”, or maybe “table”, or if it’s feeling adventurous, “roof” (cats are unpredictable, after all). The key is that it doesn’t just memorize sentences — it picks up on patterns. It learns that “cat” is more likely to be on a “mat” than, say, a “tax return.” Well, if you put your tax return on the ground, the cat might directly go sit there, but you get the point.

But how does it predict the right word?

The model isn’t just making a wild guess; it’s running the sentence through our layers of tiny interconnected mathematical functions, crunching numbers, and refining probabilities at each step. Instead of saying, “I’m sure the next word is ‘mat’”, it produces a list of options with different probabilities:

Mat: 55%
Table: 10%
Roof: 4%
Chainsaw: 0.1% (something clearly went wrong here) and so on

At the very end, this probability soup gets sorted using a function called softmax, which helps the model decide on one answer while keeping other possibilities in mind. This is also where the Temperature parameter you may have seen or experienced with happens. It modifies this softmax function to add or remove randomness in these probabilities.

Now, this guessing game alone wouldn’t be very useful if the model didn’t learn from its mistakes. That’s where backpropagation comes in. If the model confidently says “chainsaw” instead of “mat,” we need to correct it. Backpropagation is like rewinding the decision-making process and figuring out which parts of the network led to that ridiculous answer. The worst-offending parameters — the ones that nudged the model toward “chainsaw” — get slightly adjusted so it’s less likely to make the same mistake next time.

To make sure this whole process doesn’t spiral out of control, we use gradient descent, a method that helps the model adjust just the right amount — like nudging a steering wheel instead of yanking it hard left. Train too cautiously, and it takes forever to learn. Train too aggressively, and it overreacts, making it unstable. It’s all about finding the sweet spot.

And, of course, there are plenty of ways things can go wrong. The model might overfit — meaning it memorizes training data too well and gets flustered when faced with new inputs. Or its parameters might explode higher in successive calculations, turning into unmanageable numbers that wreck everything. In practice, training LLMs effectively requires high levels of complexity and several nuanced techniques to ensure optimal performance and address these issues.

In short, training an LLM is a massive numbers game where the model plays fill-in-the-blanks over and over again, learning from its mistakes and fine-tuning itself along the way.

While the model is trained on huge amounts of text data, it doesn’t simply store and memorize this information. Instead, it learns to navigate and contextualize the training data, recognizing patterns and relationships that form the structure of our languages and shared human knowledge.

You could think of the LLM training process as compressing enough information about features, patterns, and relationships found in and between data in its training set to be able to reproduce a “lossy replica” of as much of its training data as possible. For example, LLMs may produce outputs that seem coherent but contain factual inaccuracies or hallucinations as it tries to guess at the blanks in its knowledge. This can be similar to how a lossy image file might look clear but loses pixel-level detail — depending on the resolution some key details of the scene may get confused.

An example of a lossy compression of an image.

Another way to understand this training process is that it is basically simulating what you are training it on. What I mean is, if you train the LLM on the internet, you basically train it to become a kind of simulation of the internet, where, if you give it a part of a Wikipedia page, it would just predict the rest of the same exact page.

Predicting the next word (or token) in a sequence can require developing a surprisingly complex understanding of grammar, the interrelationships between different parts of the text and fields of human knowledge. For example, when training on a short crime novel, a lot of complexity is needed to accurately predict the next word in the sentence that reveals who was the murderer at the end of the story.

Remember, all the useful information an LLM learns during training is stored in its model parameters. The model has to learn patterns and features in its data rather than just memorizing facts, partly because it has a very constrained memory budget. For example, Meta’s popular LLama 3 70B is trained on roughly 15 terabytes (TB) of text training data to form a model weights file of just 140 gigabytes (GB), which means the information is compressed around 100 times. These LLMs cannot memorize all their training data in their model weights, and this is not what they are designed to do. The model needs to develop a much more robust ability to “understand,” search for, use, and combine the features learned from its training set than you could get from simply memorizing everything in a database.

💡And, by the way, it is key to remember this when you work with LLMs! You can’t assume it knows things just because they are on the internet and it was trained on them.

On some occasions, the model does, in fact, memorize certain facts and passages of specific text more directly — but this word-for-word memorization is generally only for commonly repeated information in its training data, such as popular open-source books or expressions that appear particularly often on the internet.

💡This occasional full ‘memorisation’ can be beneficial, but it can also be an issue when using these models in practice; you may need to build safeguards to avoid generating copyrighted material or harmful expressions. AI labs normally build these safeguards into their user-facing Chatbots and applications, but they are not perfect here.

Once the pre-processing is done, comes the “Post-Training” process.

So, you’ve got this big, fancy language model. You’ve spent months feeding it mountains of text, letting it play its giant word-guessing game, tweaking its internal dials through trial and error. Great. But here’s the problem: it still doesn’t know how to talk to real people properly. You are still in the “internet simulation” state.

After the initial pre-training phase, LLMs need to undergo further refinement through what we call instruction tuning and reinforcement learning from human feedback (RLHF) to enhance their performance and adaptability. This means we have to re-train the model again… twice!

See, all that text it was trained on? It wasn’t neatly labeled with questions and answers, or clear-cut instructions. The model saw books, articles, random internet debates — but it wasn’t trained to respond to things the way a human would expect in a conversation. It was trained to predict the next word to generate that would make sense. So, if you ask it something specific, it might just spit out a list of vaguely related questions rather than answering you.

To fix this, we do something called instruction tuning or supervised fine-tuning (SFT). They are the same thing. Instead of just feeding it raw text, we give it a carefully crafted dataset full of instructions and proper responses. Think of it as teaching the model how to actually follow directions instead of just rambling about everything it knows. This extra tuning makes it much better at responding to commands, answering questions properly, and generally acting more like a useful assistant rather than an internet word blender.

But we’re not done yet. Even after instruction tuning, the model still isn’t perfect. It might miss the point of certain questions, misunderstand the tone, provide dangerous information or just go off on weird tangents. That’s where Reinforcement Learning from Human Feedback (RLHF) comes in. Humans rank a bunch of different AI-generated responses to provide feedback for improving its responses to align with human preferences (or the specific policies of the AI lab creating it).

Now, you might imagine that this means humans are correcting the model live as it generates responses — but that’s not really how it works. The human feedback isn’t fed directly into the model — it’s used to train a separate system called a reward model, which then learns how to judge responses the way a human would.

Once this reward model is trained, it’s used to fine-tune and adapt the original LLM automatically. The LLM generates responses, the reward model scores them, and the LLM gradually learns to produce answers that rank higher according to the reward model, which ideally means to what humans prefer. So instead of tweaking every response manually, humans train a grader, and that grader is what actually guides the model’s learning during training.

Now, why not just have humans write out perfect answers every time? Well, because writing flawless responses for every possible input would take forever. It’s way easier to have humans rank a few options than to have them come up with the ideal answer from scratch. So, by building a reward model from human preferences, we make the LLM more useful, more aligned with what people actually want, and less likely to go completely off the rails.

Of course, there are still problems. These models can absorb biases from their training data (or from human feedback), make things up, or say things they shouldn’t. Companies are constantly working on ways to keep them in check — adjusting how they respond to sensitive topics, filtering out misinformation, and making sure they don’t accidentally encourage bad behavior. But no matter how much fine-tuning we do, LLMs are not perfect. They’re just very, very good at guessing what sounds reasonable.

So, the takeaway? The pre-training phase makes the model smart, but instruction tuning and RLHF make it actually usable for the apps you’ll build. Without these extra steps, the model is just a giant text predictor with no sense of what humans actually want. With them, it starts acting like an AI assistant that actually feels like it understands you. (Well, mostly.)

💡After a new model has finished training, its official release is often delayed many months to allow for internal and external safety testing. Further Post-Training may be required to fix shortcomings prior to release.

And, finally, the process of actually using an LLM that has finished training is called Inference. In the context of LLMs, inference involves generating responses to questions or “prompts” by applying the learned patterns and knowledge from the training phase.

Now, here’s the catch: inference has to be fast and efficient, especially when you’re dealing with real-world applications. Nobody wants to wait ten minutes for an AI-generated email reply or a chatbot response when googling takes a few seconds. And remember, these models are gigantic, with billions of parameters, meaning every time they generate a response, they’re crunching an enormous amount of numbers. If you ran them exactly the same way they were trained, they’d be painfully slow — not to mention ridiculously expensive.

That’s why AI labs use all sorts of clever tricks to speed things up. Some methods shrink the model down (via techniques like quantization and more), making it smaller and easier to run. Others optimize how computations are done, so you get the same output with fewer steps.

In short: pre-training makes the model smart, post-training makes it usable, but inference optimisations are what makes it actually economically reliable. All these steps have a crucial role. Making inference fast and efficient is one of the biggest challenges in deploying the smartest LLMs in the real world, and you’ll learn several optimization techniques if you decide to take our AI for Business Professional course.

I hope you found this overview of how LLMs process language and how they transform vast amounts of data into human-like responses, useful or at least interesting. Just remember: at their core, LLMs are giant statistical engines that predict the next word. Your job as a developer is to harness that prediction engine in safe, reliable, creative ways. They’re remarkable tools — but still tools.

Thank you for reading!