What I Look For When Hiring AI Engineers
After interviewing more than 100 candidates, here’s what actually stands out in AI engineering interviews and take-home assignments
A question I get all the time is, “How do I prepare for AI engineering interviews?”
And recently it’s usually paired with, “I keep hearing people are getting 24 hour take home assignments. What do I practice so I don’t panic when I get one?”
Here’s the practical answer that you might not want to hear…
p.s. You can watch the video version of this article here! Please consider giving it a try and supporting the channel by subscribing and leaving a quick comment. Watch my AI interview tips video here.
Most people prepare like interviews are a trivia game. They collect a giant list of questions. They memorize definitions. They do a few LeetCode problems. That stuff is useful, sure, especially for some old-school specific coding interviews, but it misses what take-home tests are actually testing.
In a take home test, the output matters. But your thinking matters more. And I say that after interviewing honestly more than a hundred potential candidates for our AI engineering roles at Towards AI.
How you clarify a vague spec and look for more info before continuing or on the other side make good assumptions with multiple scenarios, or how you choose a baseline. How you evaluate. How you document tradeoffs. How you show you can ship something real, even if it’s small. And honestly, sometimes how you show you could not ship, and why, and what you would do next. These are what we are looking for from the test. Not the final working app Claude did for you.
And if you want to prepare for them, here’s a useful truth about a lot of take-home assignments. They often come from something the company actually needs but didn’t have time to look into. A comparison they have wanted to run for a while. A quick proof of concept to test whether something works in their setup. A small internal tool they might keep if it’s decent. That’s why the prompt is often a little vague. Real work is vague. That’s the point.
I recently got an email from a student in our Full Stack AI Engineer course who said they were applying for jobs, and their friends were getting these 24 hour take homes. They wanted a weekend practice idea that feels realistic, especially with modern tools and best practices.
And I told them it’s hard to predict exactly what any company will ask, because you can’t. But you can practice the pattern.
So I gave them a project idea that is very close to something we actually care about in real client work, which I also used when testing a candidate recently.
Build a simple written document processing OCR pipeline.
Not a Colab notebook. Not a demo that only works on one perfect example. A small pipeline you can run end to end processing a real scanned document, evaluate, and explain.
I didn’t give any more details, really, but here’s what I or another manager would look for:
The first step is choosing something like ten documents. Make it concrete. Pick one type, but allow variation inside it. For example, ten invoices from different vendors. Or ten resumes. Or ten scanned recipes. Or ten medical forms if you want to be spicy, but keep it ethical and avoid sensitive data. That’s also checked! You could document how you are avoiding sending private data to Gemini’s API, or as a reason for using a local model, etc.
Let’s use invoices because they’re easy to understand, and it forces you to deal with messy layouts.
Your ten invoices could include things like one clean modern PDF, one low quality scanned page, one invoice with a big table of line items, one with a weird currency format, one with a handwritten note somewhere. You want that variation. It is realistic, and it forces you to build something robust.
Then you decide what fields to extract. Keep it tight. A good set could be the vendor name, invoice number, invoice date, total amount, tax amount, and currency. If you want one table element, include line items, but only as a stretch goal. In 24 hours, shipping the core fields well is already a win. Then work on additions or just draft the list of to-dos clearly in git issues and well documented too, assuming someone ELSE will takeover. So they need to be readable and easily applicable, NOT JUST FROM YOU! Document properly and clearly. And here, Claude Code can do it super well for you!
Now you define your target output. This sounds obvious, but it’s where a lot of people drift away from interesting results. You want a structured schema that you can store and evaluate. Something like a document record with metadata, and a set of extracted fields with types. Something you can then measure.
So, for example, your extracted output for each invoice might look like:
VendorName as text
InvoiceNumber as text
InvoiceDate as a date string in one consistent format
TotalAmount as a number
TaxAmount as a number
Currency as a three letter code
That becomes your contract. Your system either fills these fields or it doesn’t. And when it doesn’t, or does it incorrectly, you can measure it.
Now you build the pipeline. In its simplest form, it’s OCR plus extraction. You build this however you like first. But try to have a running baseline. It really doesn’t have to be ideal just a quick search on the best keyword extraction open-source or even API-based systems.
OCR takes a PDF or image and gives you text. Extraction takes that text and produces structured fields. You can do extraction with a language model, or rules, or a mix. The key is that it outputs structured data reliably, not a paragraph that looks correct until you try to parse it. Honestly, here you could just use Gemini’s API as it processes images and ask it to extract the fields you are looking for with a good prompt. That’s a good starting point.
And then comes the part that makes this interview ready.
You measure, you compare approaches and/or document how you’d do it next, depending on the timeline.
At a minimum, compare two. For example, SOTA OCR model A versus vision-LLM model B as the OCR. Or an open source OCR pipeline versus a baseline like sending your image to ChatGPT and asking it to extract fields. I’m not telling you to rely on that in production. I’m telling you it’s a great baseline in an interview because it shows you understand measurement and tradeoffs.
Then you define evaluation. Keep it simple and quantitative. For each document, did you get the vendor name correct? Did you get the invoice number correct? And so on. Count the amount out of N fields correctly. And please, take the 10–15 minutes needed to build and verify the groundtruth of your small test set. Here, it just means to actually extract the information yourself or take what ChatGPT gave you and make sure it’s fine reading through.
If you have ten documents and six fields, that’s sixty field extractions. You can literally report, “Approach A got 44 out of 60 correct, Approach B got 51 out of 60 correct.” And look at the failures and check what went wrong, then write a super short breakdown. Maybe invoice dates fail on two documents because of the format. Maybe totals fail because the OCR missed a decimal. That kind of analysis is exactly what interviewers want to see. If you cared enough to take some time for yourself to understand and investigate, rather than prompting and waiting.
Then if you have some time, make it even more real by saving outputs to a database. Again, keep it boring. A local SQLite database is perfectly fine. One table for documents, one table for extracted results, maybe one table for evaluation results. This signals that you think like an engineer. It also makes reruns and comparisons easy. Showing you can use Git and cloud services for all these is a bonus we also like to see. It’s also easily transferable to others.
And you write a clear README that explains assumptions, decisions, and what you would do next.
That README matters more than people think. Because it shows how you think. It shows whether you can communicate. It shows whether you noticed constraints. It shows what you would clarify if this were real work.
Also, if you couldn’t finish something, write it down oir in git issues as I mentioned. If you chose not to do line items because you prioritized core fields and evaluation, that is a good decision. The important thing is that you can defend it, not that you could do it in time.
For this example, just two scripts are enough.
One script to process documents and store outputs.
One script to run evals and report results.
That’s it. That’s already a very strong take-home.
Now, yes, you can use Claude Code or Cursor or whatever tool you like. I’m not anti AI tools. This is how a lot of modern engineering is done.
But use it like a professional.
Work iteratively, step by step. After each chunk, pause and make sure you understand it. Don’t click “Yes” when Claude asks you to continue automatically on new steps. Read its suggestions, plan and changes. Read the code. Ask it to comment on it clearly. Explain the approach back in your own words. Validate dependencies and versions so you’re not pulling something weird that breaks on install. And do not let it “go build a project” while you watch.
And before you send the take-home test. Please test it at least once in a new environment. Others will implement your code in a real work environment. They need to be able to follow your setup markdown file and be able to run the code without you debugging it for them. Likewise, for dependencies and using updated ones. Here, don’t trust what Claude and other LLMs suggest from the start! They might be using old versions. Ask them to confirm on google they are using up to date librairies and confirm for yourself.
Because in the interview, you will be asked why you did it that way and why you use LangChain 0.3 instead of version 1.
And if the answer is, “Because the agent wrote it,” that conversation ends pretty quickly.
This is also why I’m stubborn about project-based learning.
In our Full Stack AI Engineer course, the goal is not to memorize prompts or collect techniques. It’s to build something real. For example, your own AI tutor with solid RAG practices, real chatbot patterns, evaluation habits, and the engineering workflow that translates directly to interviews. You end up with something you can show, something you can explain, and something that feels like the work companies actually hire for.
If you’ve done take-home tests recently, I’d love to hear what you got.
What were you asked to build? What was the weirdest constraint. How long did you have to do them? And did they care more about the final result, or how you got there?
Drop that in the comments. It helps everyone else and gives me ideas for follow-up articles, since the formats vary a lot.