Why RAG Is Not Training Your AI
Are You Actually Changing the Model, or Just Giving It Context?
Watch the video:
“If I paste my company data into ChatGPT, did I just train it?”
I keep getting different versions of this same question formulated in various ways, like, “Can I inject knowledge directly into the model?” or phrased more technically, like, “Is adding data through RAG the same as modifying the latent space?” But underneath all of these is the same confusion.
Are we actually changing the model’s brain, or are we just giving it temporary information?
This question matters a lot more than people think, especially if you are building your own workflow or integrating AI into your applications. Misunderstanding this difference can lead to completely wrong assumptions about what the system is doing.
So here’s what we’re going to do. By the end of this article, you’ll clearly understand the difference between prompting, using external data with retrieval, and actually retraining a model. You’ll know what it really means to “change” a model or just augment it, and when to do each. And along the way, we’ll have to cover embeddings and latent space, but don’t worry, I won’t turn this into a math lecture. The actual math isn’t so important.
Alright, imagine you run a small company that sells climbing gear, just so we have one consistent example and because I love bouldering.
Here, let’s also say you have a return policy, a product catalogue with shows, chalk, a harness, and more, and some internal guidelines on how you want your employees to answer customers. Now you open ChatGPT, paste your return policy into the prompt, and say: “Answer my customers’ questions using this policy.”
Here, did you just train the model?
No. Not even a little bit.
And what if you create a custom GPT or a project to be able to use it easily in the future? Did it train your own version of the models?
Still not.
What you did was give the model a temporary context. The model’s internal parameters, the actual numbers that define how it behaves, its “brain”, did not change. You didn’t update its memory permanently. You didn’t alter its neural network. You simply provided extra text for this one interaction and the ones that follow in the same conversation or the same GPT.
Once the conversation ends, that information is gone. The next user doesn’t benefit from it. The model didn’t absorb it into its core knowledge. It also didn’t learn anything from the mistakes it made after you told it to answer differently. It just used this policy in its context while generating the answer.
That’s prompting.
Now let’s make it slightly more sophisticated. Instead of pasting your return policy every time, you build a small system. You store all your company documents in a database. When a customer asks, “How long do I have to return my climbing shoes?”, your system searches your documents, finds the return policy, and sends that relevant section along with the question to the model.
This is what people call RAG, or retrieval-augmented generation.
We store information externally, and the system retrieves it on demand. It’s super useful to easily customize answers to our own data.
But again, did we train the model?
Still not.
We again didn’t touch its internal parameters. We didn’t retrain anything. We built it a structured external memory system. The model can now read from your documents before answering, but its internal “brain” is exactly the same as before.
This is where embeddings come in, because you might wonder: how does the system know which document is relevant?
An embedding is simply a list of numbers that represents a piece of text. You can think of it as coordinates in a very high-dimensional space. Not just x and y, but thousands of dimensions. When we pass your return policy through an embedding model, it converts that text into a vector in this space. If we pass the question “How long do I have to return my climbing shoes?” through the same embedding model, we get another vector.
If the meaning of two texts is similar, their vectors will be close together in that space. If they’re unrelated, like “How do I maintain my car engine?”, the vectors will be far apart and the model will know it shouldn’t use this information.
Embeddings are how these systems understand the meaning of sentences and words. It just uses these vectors and compares them. That’s it. If one is closer to the others in that space, then it’s relevant or similar. That’s it. It doesn’t think any more than such simple vector comparisons when comparing and understanding external data.
So in a RAG system, we convert all your documents into embeddings and store them. When a new question comes in, we convert it into an embedding, too. Then we simply look for the stored vectors that are closest to the question vector. Close means semantically similar. So nearby in that space.
It’s pure geometry.
So, when you add a new document to your database and compute its embedding, you are not modifying the language model’s internal knowledge. You are not injecting anything into its latent space (that’s how we call the vector space where we store and compare them). You are adding a new book to the shelf and indexing it so you can find it later.
The model reads from the shelf when needed, and knows where to look as if all sentences were ordered alphabetically, but the shelf is external to the model.
So what would it mean to actually change the model?
A large language model is a massive neural network with millions or billions of parameters. Parameters are just numbers, like those in our vectors. On their own, they don’t mean anything, but together, they create a whole that has meaning and importance. In the model, they are the weights that determine how inputs are transformed layer by layer inside the network. During training, these weights are adjusted slightly over and over again so that the model gets better at predicting the next word.
Training literally means changing those numbers to maximize our chances of predicting the right next word.
If you fine-tune a model on your company’s climbing gear data, now you are actually modifying the brain. You are nudging many of those weights so that the model’s overall behavior shifts. It might start answering in your brand voice more consistently. It might internalize patterns from your documentation.
But even then, you are not inserting a clean fact into a specific location. The knowledge becomes distributed across many, many parameters. There isn’t a single neuron labeled “climbing shoe return policy.” The information is smeared across the network in a highly interconnected way. This is also the reason why training and retraining models is quite more complex than adding some sort of external memory, and why it may as well hallucinate, since those facts you are trying to teach it aren’t clearly saved anywhere but just influencing the model’s future generations.
This is where the term “latent space” I mentioned often shows up and confuses people.
Latent space sounds mystical, but it’s not. Inside the model, when tokens enter, they are converted into internal vectors. Pretty much the same as those we discussed earlier, related to the embeddings we retrieve in an external memory. Except they are internal to the model in this case. These vectors are transformed again and again across many layers. All those intermediate representations live in what we call the latent space. It’s simply the internal representational space of the network. It’s pretty much its brain and how it understands the world, plotting vectors for each concept and word.
But again, it’s quite important to understand that this latent space is not a database stored somewhere inside the model. It’s not a map you can open and edit. It is the structure that emerges from how the parameters transform inputs.
This is the reason why researchers can say things like “we don’t fully understand how it’s capable of answering this or that question”. It’s because this space is so large and trained using trillions of words that we can’t fully grasp what’s being leveraged and not from this space. Just like our brain. It’s just too much information processed at once and is interrelated. Still, we fully understand what the model is doing, don’t believe people saying AI models train themselves and they are getting out of hand; they simply add and multiply tons of vectors and matrices we fully control together to get final answers. It’s just hard to properly visualize and recognize what influences what, given how large the space of numbers is, but this is intended as the more of these numbers we have to represent meaning, the more the model can represent complex patterns and generate the right words in most scenarios.
I keep talking about latent space and parameters, but “Are the parameters the latent space?” The precise answer is that the parameters define the transformations, and those transformations create the geometry of the latent space. The latent space is the behavior induced by the parameters. It’s not a separate object you can directly manipulate.
And this explains why you cannot just inject data into the latent space. There is no control panel where you say, “Insert fact: our return window is 45 days.” If you want that to be permanently reflected in the model’s behaviour, you need to retrain or fine-tune it, which means adjusting the weights through optimization. And again, you have no uncvertainty it will properly retain it, and you also could break other parts of its knowledge while doing so since you are transforming this space through the modifications of the model’s parameters.
If you want certainty, it will know about a specific fact; you either give it temporary context through prompting or give it external memory through retrieval.
One more quick clarification about recent reasoning models, because this ties into the same misunderstanding. When a model appears to reason step by step, what’s happening internally is still just vector transformations in latent space. There isn’t a separate symbolic planner sitting inside, as we do as humans when thinking. There isn’t a neatly organized knowledge base with labelled facts. There are continuous numerical representations being transformed according to the learned parameters.
Humans also form internal representations, of course. But ours are grounded in perception and action. We see climbing shoes, we touch them, we use them, and we continuously update our understanding. Most large language models are trained, then frozen. They don’t update their parameters every time they read something new. It would be extremely complicated and costly to do so based on all the interactions we have with ChatGPT.
So when you paste your return policy into ChatGPT, you are not reshaping its internal world model. You are providing context. When you build a RAG system, you are building structured external memory. And when you fine-tune, you are actually changing the internal parameters, which reshapes the geometry of the latent space.
This distinction changes how we design systems. It prevents you from assuming that adding documents to a vector database is the same as teaching the model something permanently. And it helps you understand why “just inject it into the model” is not how these systems work.
The mental model I want you to keep is simple. The parameters are the brain. Training changes the brain. Embeddings are coordinates that let us organize and search for meaning. RAG is a bookshelf the brain can read from. Latent space is the internal geometry created by the brain’s transformations.
We want to retrain that brain to teach it a new language or become an expert in a specific field, not to learn about specific facts or our return policy. Retrieval-based systems, meaning adding a memory to the system, are usually far less expensive, more powerful and more controllable.
If you’d like, I can go deeper in a follow-up article on long-term memory systems and how we deal with that, because that’s where things start to blur and get really interesting. Let me know in the comments if that sounds interesting!
Thanks for reading through!