note №.007 · 2026 · 05 · 126 min - or one slow espresso

The interesting boundary
around the model.

On context engineering, tool design, observability, and the quiet realization that the model is almost never the hard part.

If you spend long enough building with LLMs, you notice something quiet: the model almost never becomes the hard part. The hard part is everything around it. The tools you give it. The context you let it see. The instructions you can revise without redeploying. The approval gate before it actually does anything. The observability that lets you debug when it inevitably doesn't.

I started saying "the boundary around the model" maybe six months ago, and I keep finding new reasons to. It started as a useful phrase for a whiteboard. It's become a kind of test I quietly run on every AI product I see: how much of the work is actually in the boundary, and how much is theatre about which model they're using.

The model is the easy part.

I've worked on cybersecurity copilots that started with "let's get a model to summarize alerts" and ended, after eight months, with a system where the model is doing about 8% of the engineering work. The other 92% is everything that lets the model be useful, safe, and explainable to the person reading its output.

That 92% is:

  • a careful retrieval layer that pulls only what's actually relevant;
  • tool definitions tight enough to be useful, loose enough to compose;
  • an approval queue that lets an analyst override before anything ships;
  • structured outputs that downstream systems can actually parse;
  • a feedback loop where wrong answers become tomorrow's evals;
  • and a fallback path for when all of the above still doesn't work.

None of these are model problems. All of them are product problems wearing engineering hats. You can swap out the model later[1] and most of this code will not care.

What "good" feels like.

The teams I've watched ship real AI products share a handful of habits. None of them are about the model itself.

They spend more time on tool design than on prompt engineering. They write down what wrong looks like before they ship anything. They treat eval data the way good infra teams treat schema migrations: reviewed, versioned, and not invented in the middle of an outage. They obsess over the approval flow because they know that's where their product's trust either survives or doesn't. And they argue about latency and cost the way the previous generation argued about p99s, because at scale, cost-per-query is a feature.

The model is the part that looks like the product. The boundary is the part that is the product.

The shortest version of all of this: the model is the part that looks like the product. The boundary is the part that is the product. The boundary is also where most of the actual engineering judgement lives: what to retrieve, what to let the agent touch, when to pause for a human, when to fail loudly.

A small principle.

When someone tells me their AI product feature isn't working, whether they're a friend at a startup, a candidate in an interview, or a teammate pinging me on Slack, I usually ask three questions, in this order:

  1. What context does the model have when it makes this decision?
  2. What tools does it have, and how strict are the input contracts?
  3. What happens when it gets it wrong: who notices, and who can intervene?

The model almost never comes up. By the time we've worked through those three, we've found the bug, or the missing eval, or the boundary that was never actually drawn. The model was fine. The model is almost always fine.

I think a lot about what the next two years of this look like. Models will keep getting better, faster, and cheaper[2]. The boundary problem will not go away. The teams that figure out their retrieval, their tools, their approvals, and their evals will keep shipping. The teams that don't will keep impressing themselves with demos.

If you're building one of these systems, start with the boundary. The model will catch up.

- end of note -
  1. Within reason. I've watched a model swap break a system once, not because the new model was worse, but because the old system's prompts were quietly load-bearing on a specific failure mode. Add it to the list of things that aren't really the model's fault.
  2. I am also wrong about this approximately 30% of the time. The other 70% feels worth planning for.
filed under →aiagentssecuritybuildingopinions
↬ read next:

Search, on top of object storage.

What you give up, what you get back, where the math actually works out, and the strange small joy of a query plan that ends in s3.GetObject.

continue →