How to Control Bias in AI Agents

The guardrails, audits, and human review loops that actually work

How to Control Bias in AI Agents

Watch the video:

As AI agents become more autonomous, won’t they just amplify their biases and make everything worse?

I got that question a lot recently.

That sounds reasonable. If a model already has biases, and now we give it more power, memory, tools, long-term planning, and the ability to act, doesn’t that just scale the problem?

And I think many people watching this are wondering the same thing, even if they don’t say it like that.

So in this video, I want to do three very clear things.

First, explain what bias actually means in the context of LLMs, and why a bias isn’t automatically bad.

Second, explain what fundamentally changes when we move from a simple language model to an autonomous agent.

And third, show how we can realistically control bias as autonomy scales, not just at the model level, but at the system level.

Since it’d be better to go through this with one example and because I’ve spent way too many hours hiring for AI engineers and marketing recently, imagine you have a company that builds an AI agent that screens resumes, shortlists candidates, schedules interviews, and even suggests final rankings to the hiring manager. Not just a chatbot answering questions. A system that takes actions.

Let’s start at the beginning.

When people say “LLMs are biased,” what does that actually mean?

Bias in a model simply means it represents patterns in its training data. That’s it. A model trained on internet-scale text will reflect statistical regularities in that data. If certain professions are more often associated with certain genders in the data, the model will learn that correlation. Not because it wants to discriminate. Not because it has intent. But because that’s what’s statistically present.

Bias is not automatically bad. In fact, without bias in the statistical sense, there would be no learning at all. Learning is detecting patterns. The real issue is not that the model has biases. The issue is what is in the data, which patterns are reinforced, and which ones we allow the system to act on.

So with our hiring example, if historical hiring data reflects past inequalities, the model may learn those patterns. That’s not a moral choice by the model. Its representation.

So here, in our case, we’d probably like to retrain a model to better align with our hiring needs and company guidelines.

But what changes during fine-tuning and alignment exactly?

After OpenAI creates their model, which we call the pretraining phase, companies apply techniques like reinforcement learning from human feedback, reinforcement learning from AI feedback, reward modeling, preference optimization, and more recently reinforcement learning with verifiable rewards. Basically, we teach the model to act as we want.

Here, in simple terms, humans or AI systems rank outputs and the model is optimized to produce answers that align with our preferred behaviours: helpful, safe, fair, less toxic, more neutral.

This does reduce certain harmful outputs. It can make the hiring assistant more cautious about sensitive attributes. It can teach it to avoid explicitly discriminatory language.

But here’s the key point.

Retraining our model like this reshapes behaviour. It does not erase the statistical structure learned during pretraining. The underlying representation of the world is still based on the data distribution, which mostly comes from all the internet’s available data.

When retraining, we are steering outputs, not rebuilding the entire internal model of reality.

Now let’s introduce the real shift that we actually wanted to build here for our hiring purposes: agents.

A plain LLM generates text. You give it a prompt, it gives you a response. If the response is biased, it’s a biased sentence.

An agent is different.

An agent has a goal. It can plan over multiple steps. It can call tools. It can store memory. It can filter information. It can take actions based on intermediate results. All autonomously.

So in our hiring example, instead of just answering “what makes a good candidate,” the agent might:

Read a batch of resumes.
Rank them.
Request more data from an internal HR system.
Schedule interviews.
Update a shortlist over time.
Adjust its criteria based on performance metrics.

Now we’re not talking about a biased paragraph that we will edit anyway. We’re talking about a decision loop impacting people’s lives.

And this is where autonomy changes the impact of bias.

If there is a small skew in how the agent evaluates certain backgrounds, and it repeatedly filters candidates based on that skew, the system can amplify the pattern over time. Especially if it logs its own past decisions and uses them as feedback.

Planning, memory, and tool use create feedback loops. And feedback loops are where small effects can compound exponentially.

There’s also a new risk that comes with agents: self-reinforcement.

If the hiring agent is evaluated based on “time to hire” and “retention rate,” it might start optimizing aggressively for signals that correlate with those metrics in historical data. If historical data is biased, the optimization process may lock into those same patterns.

This is not because the model suddenly became evil. It’s because optimization plus autonomy plus imperfect objectives can amplify distributional skew. It would be like giving your employees a huge salary bonus based on the number of candidates we interview, regardless of whether they are a good fit. I doubt you’ll increase the good candidate rates that way!

So should we panic?

No. Because here’s the important flip side.

Agents are not just models. They are systems.

And systems can be constrained.

When people talk about bias mitigation, they often focus only on the model. Bigger model, better alignment, more RLHF, more constitutional training. All useful. But that’s only one layer.

With agents, you have multiple control points. Multiple ways to mitigate biases and limit them. You are not entirely dependent on one single generation of a paragraph, hoping it will be good. You can steer language models and build workflows around them.

You control what data the agent can access.
You control what tools it can call.
You control what metrics it optimizes.
You control when it must escalate to a human.
You control validation steps before actions are executed.

In our hiring agent example, you could:

Remove sensitive attributes entirely from the evaluation pipeline.
Force structured scoring rubrics with predefined criteria.
Insert fairness checks before final ranking.
Log every decision for audit.
Require human approval before sending rejection emails with a clear reasoning.
Run bias evaluation benchmarks regularly on synthetic candidate sets.

Now bias mitigation becomes a system design question, not just an abstract model training question.

And this is where newer alignment techniques I mentioned like RLAIF, RLVR, reinforcement fine-tuning, and constitutional approaches come in. They try to shape high-level behavior. For example, training the model to prefer responses that treat demographic groups symmetrically, or to justify its reasoning under fairness constraints.

That helps. But it’s still steering behavior.

If the environment and objectives are poorly designed, the agent can still optimize in unintended ways. So the lesson is not “alignment fixes everything.” The lesson is alignment is one layer in a larger stack.

As autonomy increases, evaluation must increase too.

For a static chatbot, occasional red-teaming might be enough. For an autonomous hiring agent, you need ongoing monitoring. You need scenario testing. You need to simulate edge cases. You need observability: logs of which resumes were filtered, why, and what intermediate reasoning was used. You need to be able to backtrack.

The more independent the agent, the more you need explicit structure around it.

Here’s a simple principle I like: scale constraints as you scale autonomy.

If your system has low autonomy, a prompt and a safety fine-tune might be enough.

If your system is making real-world decisions over time, you need architectural guardrails at all levels. Not just a better prompt.

And just because we can’t be clear enough about biases… they are not a bug that appeared when we invented LLMs. It’s a property of data and of the world. We are all biased, and so is our society. Which is both good and bad. The goal is to maximize the good biases we have and minimize the worst ones. And since models reflect data, they’ll simply reflect that.

The good thing with agents is that they act within systems. If we design those systems carefully, we can decide which patterns are acceptable, which must be corrected, and where human oversight remains mandatory. We don’t rely on OpenAI or Google anymore on how THEY decided to train their model, even though we use them.

As agents become more autonomous, bias stops being just a model problem and becomes a governance and architecture problem.

And that’s actually good news.

Because architecture is something we can design.

In our hiring example, the goal is not to remove all bias. That’s impossible. The goal is to define acceptable criteria clearly, align the model to them, constrain the environment, monitor outcomes, and intervene when drift occurs.

So instead of asking, “Will autonomous agents amplify bias?” maybe the better question is:

Have we designed the system around them carefully enough?

Let me know in the comments what kind of agent you’re building, and whether bias is something you’re actively thinking about in your architecture and what you’re doing about it? I’m sure it could help others, and I’d love to know!

Thanks for reading through. I’ll see you in the next one.