What Did Microsoft's MAI Model Actually Train On?

Model choice is not only about benchmark scores. It is also about what behavior the model inherited.

What Did Microsoft's MAI Model Actually Train On?

Right now, almost every AI lab is quietly training on other models’ output — copying each other’s work, basically. Microsoft did the opposite.

They refused to use synthetic data, then actively hunted down AI-generated content and removed it before training. And they wrote a 100-page report daring every other lab to prove they did the same.

p.s. to experience the full bundle, watch the youtube video version! ;) 
Watch here: https://youtu.be/Sl5O7KVVF6M

This month, Microsoft AI announced seven MAI models built in-house across reasoning, coding, image, transcription, and voice. But the important one for us is MAI-Thinking-1, their flagship reasoning model. It beats Claude Sonnet 4.6 on AIME 2025, a very prestigious U.S. high school math competition. However, what makes this release worth your attention is not the benchmark performance alone, but the report that’s more transparent than anything I’ve read from a major lab this year.

Here’s why that matters. Almost every other AI lab that has done this has achieved it by using synthetic data or distillation. Meaning they trained on another model’s output.

DeepSeek used 800,000 reasoning samples to do it. Llama 4 was distilled from a bigger Llama. Qwen3 literally named their method “Strong-to-Weak Distillation” in their own report. And then there are the disputed claims around OpenAI, Anthropic, and Google all copying from each other.

So everyone in AI is asking the same awkward question. How much of a new model’s intelligence was actually learned, and how much was just copied? Microsoft decided to make their answer crystal clear.

On the second page of their report, Microsoft wrote that “We choose to not use any synthetic data generated by language models during pre-training and make an effort to avoid and remove AI-generated content within collected data sources.”

And look, distillation exists for a reason. Once you have a strong model, generating a million more training examples from it is cheap and fast. Doing it the clean way, finding and verifying real human data, is expensive and slow. And Microsoft paid that bill instead.

I’m Louis-François, CTO and co-founder at Towards AI, where we turn engineers into AI engineers who build and ship AI products. Let’s get into it!

First, let’s talk specs. MAI-Thinking-1 is a mixture-of-experts model: 1 trillion total parameters, 35 billion active per token. So you get a trillion-parameter scale, but each token only pays for a thin active slice — which is what lets you serve it on GB200-class systems, instead of eating dense-1T compute on every forward pass.

The model was pre-trained on 30 trillion tokens, then mid-trained on another 3.55 trillion. Mid-training is basically a second pre-training pass on a smaller, cleaner mix. For this model, it is focused on STEM, math, and code, and used to expand the context from 16K to 64K and then 256K.

The headline benchmarks are real and we’ll get to exactly where it wins and loses later. The point for now is that they hit competitive frontier numbers while refusing the shortcut everyone else takes.

Now, this is easy to overstate, so let’s be precise. Microsoft isn’t claiming zero synthetic anything. Later in RL, they do self-distillation from their own checkpoints, use synthetic environments in places, and merge specialist teachers back into one model. The actual claim is narrower and more interesting: no LLM-generated synthetic data during pre-training, active removal of AI-generated content from collected sources, and no distillation from third-party models. The base capability is supposed to come from human-generated data and their own training stack, not Claude, GPT, Gemini, or a hidden teacher.

Microsoft frames the whole thing around three design principles. The first — capabilities should be learned, not inherited; a model that copied someone else’s thinking is harder to steer. Second — simple, clean recipes scale; shortcuts compound into hacks. Third and last — if you can’t prove a choice helps, you don’t make it. And if those sound like jabs at every other lab in 2026… that’s because they are.

A lot of labs say “we don’t use synthetic data.” But that phrase gets slippery fast. Microsoft actually drew four hard lines.

We’ve covered the first two. The other two are where it gets unusual. On one hand: no `off-the-shelf open-source datasets — everything processed in-house from raw sources they control. And on the other hand: no private customer data unless the user opted in.

They even excluded Hugging Face from their crawl. The biggest open AI hub on the planet — and they refused to train on it, because they couldn’t trust what was in it. Imagine being in 2015 and banned from using Wikipedia.

And while we are talking about it, the attention to detail goes deep. For Wikipedia, they trained on the raw markup instead of the clean HTML, even though it’s three times more verbose. Why? Because the stripped version loses the infoboxes, those structured panels with population, dates, all the dense factual stuff. They paid 3x the token cost just to keep that signal.

But here’s the part I think AI engineers should actually steal. They didn’t pick their data mix on vibes. They trained 183 models from scratch across 61 different mixtures to find out what worked. And the lesson was a painful one: small models lie. A STEM-heavy mix looked best at small scale, but a code-heavy mix won once the model got bigger. So if your data experiment only works at toy scale, it might be tuning your experiment, not your actual model.

The final mix is also interesting. It is mostly code — over half — heavy on STEM, and surprisingly light on math… except they repeat those math tokens five times over, more than anything else. The strategy wasn’t to train on the internet and pray. It was: measure which data actually predicts the skills you want, then prove it still holds when the model scales up.

Now if you refuse every shortcut, the bill comes due somewhere. And Microsoft is honest about it. Page 30, they admit it plainly: this is their first reasoning model, so it started from scratch with no chain-of-thought to copy — which made keeping training stable their central problem. With distilled traces, the model starts with something that already looks like a chain of thought, and RL sharpens it. Without that, early rollouts are worse, rewards are noisier, long traces burn more inference compute, and the climb has more ways to collapse before the benchmark number moves.

The cost of not starting from third-party distilled data is visible in three places. First: They had to add extra RL stability machinery, including adjustments to GRPO, self-distillation from their own checkpoints, and infrastructure work to remove numerical mismatch between training and inference. Second: They biased their mid-training mix heavily toward STEM and code, basically to substitute for the bootstrap that distilled traces would have given them. And third: the pre-training loss curve, on page 23, has visible spikes early in the run. They recovered every one without intervention. That last part is not small. On 8,192 GB200 GPUs, every wall-clock hour is 8,192 GPU-hours before engineers, scheduling, or opportunity cost. Microsoft reports 90 percent goodput across the 30-trillion-token run, with only 6.5 hours of recomputation overhead. That’s the expensive version of “it didn’t crash badly enough to stop us.”

So after choosing every hard path — where does it actually land? Microsoft’s own words, page 53: “it doesn’t lead the field”. And I respect that they wrote that themselves. The win is real: on AIME 2025 it beats Sonnet 4.6, 97 to 95.6. The losses are just as real. It trails both Anthropic models on SWE-bench Verified, and gets doubled by GPT-5.4 on Terminal-Bench. But they never trained on terminal environments at all. So that score isn’t a failure it’s a model doing fine at a task nobody taught it. Train the next version for it directly, and it should jump.

The human side-by-side says the same thing. Against Sonnet 4.6, MAI wins on conciseness and style, ties on factuality and instruction-following. Overall preference: plus 0.07, basically a tie. Against Opus 4.6 it loses by the same margin. So the bet didn’t make Microsoft the best model in the world. It made them a real competitor — and a lab you can trust. Whether that was worth the cost becomes the actual question.

The thing I want to land here is just this. Most of these labs are great labs doing great work. Distillation isn’t necessarily a crime; it’s a tradeoff. But Microsoft is one of the very few labs being this specific about exactly what honesty they’re committing to. And in 2026, that kind of detail is the rare part.

So what do you take from this if you’re building? Two things.

One, if you’re picking an open model, ask what it’s downstream of. That changes which biases, refusals, formats, and weird teacher habits you inherit, which is what Microsoft’s first principle is actually about.

Two, if you’ve been worried about AI slop, the feedback loop where models train on previous models’ outputs until quality slowly degrades, MAI-Thinking-1 is one of the cleanest answers from a frontier lab so far. It didn’t beat everyone. But it didn’t eat the slop at the base-model stage. For teams that care about lineage and enterprise trust, that might matter more than one or two leaderboard points.

The question I keep coming back to is whether the cost was worth the principle. Microsoft says yes. The benchmark numbers say, “kind of.” Six months from now we’ll see whether Microsoft’s lineage scales further than the distilled ones, or whether the distilled labs have figured out how to climb on their own without it. 
What do you think? Was it worth it? Let me know in the comments. And as always, I’ll see you in the next one!