We’ve spent the last few weeks building a bespoke internal platform, a professional services automation (PSA) tool: the system that runs a consultancy, from the first lead through to the final invoice. Reasonable complexity. Real engineering. Real money flowing through it. The first phase has landed the core of what the system needs to do, essentially the parts where complexity, constraints and design actually get tested. We’ve since worked through the usual hardening, security controls and deployment, and we’re prepping for UAT now. The work that proves the thing is worth building landed early, and we’re excited about what we’ve seen. Not in the product but in the approach.

The point of this piece isn’t the platform. It’s how the work got done. Some of the shapes we’ve used for a decade aren’t as effective anymore, and a few needed to be put down before the better answer showed up. That’s not always easy to adjust to.

I’m sharing this piece for two audiences sitting in the same room: the practitioners doing the typing and the leaders deciding which bets to take. Both keep showing up in the same conversations now.

What the approach produced

Some numbers first, because everything after this is about judgement, and judgement is easy to assert without evidence. A one to two person team. Twenty-six days. 404 commits and around 380 pull requests merged to main, every one through full CI. Twenty-one domains with strict boundaries, an MCP catalogue of nearly 400 tools covering all of them, and 2,606 tests running against a real Postgres database rather than mocks.

When Opus 4.8 landed, the cadence jumped again. 52 pull requests merged in a single two-day stretch, roughly one every forty minutes. Not by typing faster. By setting goals rather than tasks, letting agents plan, build, test and open the pull request, then running several workstreams in parallel while I steered. I reviewed at the pull request, not the keystroke, because the controls did the gatekeeping. More on those controls shortly.

The big one: challenge your own approach

The single biggest thing I took from this build wasn’t a pattern, a stack choice or a workflow. It was that I went a long way from how I’d usually have approached this, and here’s how it’s paid off.

Patterns ossify. Beliefs about the right way to build are useful right up until they’re the thing in your way. We’re all used to being challenged by our peers, but teams can also become echo chambers. The presence of a model that will argue back, fluently, in your domain makes it cheaper than it’s ever been to interrogate your own defaults. The work I’m proudest of in this phase is the bits where I told the model to push back and then actually let it. Let’s be honest, this is as much about how you guide AI as it is about expecting results out of the box.

That’s the meta-lesson, particularly with the common patterns we see with this generation of models. Don’t let your AI tooling agree with you. Build in challenge and disagreement. Read what comes back honestly. And be willing to revisit which engineering instincts are still earning their keep. A few of mine were not.

What that looked like in practice

Greenfield is where you most want a counter-argument and least likely to get one. A blank doc is too compliant. So is an over-eager model. So I made every meaningful early-architecture conversation an argument, and every reversal got written back into the spec, not as a tidy retrospective but as the live trace of how the decision actually got made. The team’s parallel requirements gathering then plugged into that broad design and surfaced the operational complexity that drove our end-to-end testing approach (more on that shortly).

Some examples, deliberately picked from where the answer surprised me.

The “modular monolith” call. Not my default. I’m not a default-monolith person. But for this platform (small team, around 30 people, single-tenant, internal), when I sat with the trade-offs honestly, the seams I really needed were between domains in code, not between functions in deployment. So that’s what got built, with a linter rule and CI grep enforcing no cross-domain JOINs and no reaching into another domain’s internals. Right call for this case. Not a doctrine.

SQL where SQL fits. I love DynamoDB. Most of our recent work has put it to good use. But the access patterns here (open-ended reporting, ad-hoc joins, cross-entity margin and utilisation queries) are the shape SQL was actually built for. So Postgres won this round. Not a reversal of position.

Good engineering is matching the flow to the tool, not the tool to the resume.

Defer the infrastructure. Local-first build. No AWS scaffolding until the first integration actually needed a public URL, at which point Lambda came in narrowly, for that one need. The reason isn’t laziness, it’s speed. While the concept is still evolving, every hour spent on infra you’re not yet sure you need is an hour you couldn’t spend testing whether the design is right.

No UI for now. This was the one that took the longest to sit with. Eventually we’ll build one, begrudgingly, for external users where chat isn’t going to be the channel. But for internal use, the interesting realisation was that there might be entire use cases where we never need to. That changes how you approach the build. And it makes one thing concrete: AI is now a channel, not a feature. Designing the system as if chat is a real interface, not a side door, changes which decisions matter.

A stylised dark building at night, its tall front door glowing with coral and indigo light beneath a chat-bubble sign labelled "chat", people streaming up the lit steps into it, while a small dim side door labelled "web/UI" sits unused in shadow.

These weren’t decisions made because an LLM told me to. They were made because the conversation was honest enough that I changed my mind. The model’s job was to keep me honest.

The pattern: make the wrong thing hard to do

Here’s what makes the speed above sustainable rather than reckless. Speed doesn’t come from cutting corners; it comes from guardrails an AI can’t drive through, so you can move recklessly fast inside them. Every control we built exists to let us say “yes, ship it” sooner, not to slow anything down.

A single glowing racing line, a coral-to-indigo light streak, sweeping at speed through a dark circuit and threading the gap between two solid, continuous teal guardrails it never touches, with faint failed trajectories veering off into the dark. A small label reads "move fast inside the rails".

So the rules are mechanical, not matters of discipline or review. A boundary checker parses the code on every pull request and fails the build on what we’ve decided is never allowed: one domain importing another, a cross-domain database JOIN, money computed in floating point instead of decimal. The known exceptions sit in a short allowlist, each with a tracking comment, and that list is only ever allowed to shrink.

The wrong thing isn’t discouraged; it doesn’t compile.

The same logic applies to the code itself. The AI that writes the code doesn’t get the last word on whether it’s right. We run the work as adversarial by default: agents fan out across a change from independent angles, then are prompted to refute each other, and a result survives only if the others can’t break it. Generating code and trusting it are not the same act, so the adversarial pass is built into how the work runs, not a separate review step someone has to remember to invoke.

The automated test suite is the floor under all of it. Every test runs against a real Postgres database rather than mocks, and the reconciliation tests, the ones checking that a set of invoice lines always sums to the invoice header, are set to fail the build if the rounding bug they were written for ever comes back. The net stays correct even while everything above it moves fast. But the most valuable testing on this build wasn’t automated at all.

The pattern: AI as a test pilot

Once enough of the system was alive to run a scenario end-to-end, we put Claude Cowork on it as a user, not a builder. Create some people, win an opportunity, allocate the team, log a week of timesheets, raise an invoice, run the reports.

The model holds state. That’s where the value is. It quietly cross-checks numbers between entities. The roster says £31,800 of planned revenue for the team, the forecast says £9,500, and those should match. It surfaces the bugs that no unit test will ever find because they live in the seams between modules.

In about three hours we found ten distinct issues. All reported and fixed mid-session. All re-verified before moving on. A sample:

Budget fields silently dropped on project create.
The won-opportunity → project carry-over producing junk codes like “Acme -”, no project manager, no dates.
Allocations with no matching rate-card entry silently returning £0 instead of erroring.
A full double-billing path where one approved timesheet ended up on two separately-approved invoices, because the data model didn’t link timesheet entries to invoice lines.
A revenue forecast that ignored a “contracted” tier, so a Won £1.5M deal only counted £31.8k of actual allocated work.

The point isn’t the count, it’s the shape. None of these are “this function returned the wrong value”. They’re all “two parts of the system disagreed about reality”. That’s exactly the class of defect that’s expensive when real users hit it and almost free to find with a model running a story. Point AI at a running internal system before any human gets near it. You’ll burn through a session-worth of bugs that would otherwise turn into early-adopter pain. It’s one of the highest-value uses of model time in an engineering org, and almost nobody is doing it.

The pattern: visualisations are diagnostic, not decorative

We kept asking Claude to render state rather than describe it. Stacked-bar P&L per project. Capacity heatmaps for the team. Invoice timelines.

Every one of them found a bug the markdown table had buried.

The P&L stacked bar came out half-empty, with most projects showing zero revenue. That exposed a real bug in the recogniser logic, which was crediting revenue only for fixed-price work and silently ignoring T&M. The capacity heatmap put one person at 207% utilisation in a single glance. The before-and-after invoice timeline made the double-billing path impossible to argue with.

The picture finds the second thing wrong. The words have probably already told you the first. The visual is doing diagnostic work, not decorative work; it pulls apart things that look fine in tabular form.

The pattern: respect your context window

Halfway through this build, the move that paid off was to stop re-explaining context every session and instead bottle it into a small custom skill. That skill pins the role, the current phase of work, where the docs live, the ticket conventions, and which subagents to spawn for the specialised bits. Open Claude, name the skill, and off it goes, already knowing the rules. Underneath it, three systems of record stay in lockstep: the rules in a project instructions file (CLAUDE.md), the work in our issue tracker (Linear), the living spec in our knowledge base (Obsidian). The skill knows where each one lives, so I never have to say it again.

A few things this unlocked, with numbers that should mean something to anyone who’s tried to do real work in a chat window:

One hour, vague idea to buildable spec. A single greenfield session took a vague “we’re thinking about a tool that does these things” into ten linked Obsidian spec docs, forty-six tickets across five milestones, and a naming convention adopted vault-wide. No “now write it up” step. The chat was the deliverable.
One weekend, blank repo to roadmap with shape. Across the same skill-orchestrated pattern, a weekend produced a multi-doc architecture spec, an eighty-requirement gap analysis, and fifty-five additional cleanly-shaped tickets across the right milestones. Total in flight today: roughly a hundred tickets, ten milestones, every one with an owner and a clear acceptance shape.
Subagents for the specialised bits. Discovery, analysis, ticket-shaping, doc consolidation, all spawned cleanly from the orchestrator without burning the main session’s context. The orchestrator stayed focused on the conversation. The specialists came back with concrete outputs.

The orchestrator skill is only half of it. The bigger surprise was how much value came from small, single-purpose skills, each one heavily scoped to do exactly one job well: a UX audit, a security audit, a check against a specific standard. Narrowing the brief is what makes them work. A skill that audits one thing against an explicit checklist beats a general “have a look at this” every time, and those tightly scoped audits have been one of the biggest unlocks of the whole build.

If you keep re-explaining your role, phase, conventions and tools every time you open a session, that’s a skill waiting to be written. Your context window is a resource. Spend it on the conversation, not the preamble. The productivity gap between an engineer who has packaged their context and one who hasn’t is widening fast. It’s a coachable, learnable practice, and worth making time for.

MCP as a first-class transport: the bold bet

This is the one I think other CTOs will find most interesting, and the one I’m watching most closely.

The platform exposes its capabilities over both REST and MCP, against the same business logic, in the same Lambda. The MCP tool catalogue isn’t a sidebar; it covers every domain. Read tools return enough context to avoid a second call. Write tools take idempotency keys. Permission errors explain why they failed so the model can route around them. Both transports enforce the same permission checks from a single source of truth, so a gate can’t exist in one and quietly go missing in the other.

An isometric systems diagram on a near-black canvas: a single glowing crystalline core cube labelled "BUSINESS LOGIC" and "PERMISSIONS", with two identical, equally sized glowing channels feeding into it, one labelled REST and approached by a human figure, the other labelled MCP and approached by small agent figures, both passing through the same permission lattice.

The bet: at this scale, the UI is whatever you ask Claude. Where chat is genuinely awkward (allocation grids, approval queues, pre-invoice review), Cowork artifacts fill the gap on demand without a web build. Where we eventually need a real UI for external users, we’ll build it on the same /v1/ API.

None of which is to pretend the web side is easy. It’s still much harder than the API. Standing up a new capability as a set of tools is quick; making the same capability feel right in a browser, the layout, the placements, the dozens of small UX judgements, lags a long way behind. The gap sounds obvious, but in practice it’s huge, and it’s a large part of why chat-first is so appealing: it sidesteps the most expensive surface we build.

Honest question: who else is doing this?

Most teams I see are still treating MCP as a clever bolt-on. A thin wrapper over a REST API, often built second, often as an experiment. We’ve built it the other way round. REST and MCP are siblings, equal-priority transports on the same business logic. The MCP layer has the same coverage as the API. It is, deliberately, the primary channel.

That sounds like a small engineering choice. It isn’t. It’s a positioning choice.

If your internal tools are MCP-native, your roadmap changes.

The cost of building a new internal capability drops because there’s no UI to design. The audience for any new feature expands because anyone in the company who can hold a conversation can use it the day it ships. The cost of automating a workflow drops because the channel is already wired up.

For years, “what’s the channel?” meant “web or mobile”. Now, for internal tools, the answer is often just “chat”.

What’s coming next

Phase 1 has landed the core. The system has been put under real complexity, real constraints, real design pressure, and the patterns above have held up. The hardening, security controls and deployment are behind us. Next is UAT, and we’ll know more once real users start putting it through its paces.

What we’re already confident about:

The MCP-primary architecture is the right shape for the way our people actually want to work.
The willingness to challenge our own defaults, to honestly examine “is this still the right call?” on every consequential decision, is the practice that most consistently surfaced the better answer.

What to try

Four cheap experiments, in order of cheapness.

Open a chat instead of a blank doc. Tell the model to push back. Let the conversation be the deliverable. Re-edit decisions in place. By the end you’ll have something buildable and an honest trace of how each call got made.

Point it at the running thing before a human does. A focused test-pilot session against a fresh internal tool will find more interesting bugs than a fortnight of unit testing, because it cross-checks the entities the unit tests are scoped not to touch.

Ask for the picture, not the table. When you’re testing something complex, render the state. The visual is doing diagnostic work, not decorative work.

Bottle your context. If you re-explain role, phase, tools and conventions every time you open a session, that’s a skill waiting to be written. The payoff shows up the next time you sit down.

None of this required new tooling. It required asking different questions of the tooling we already had, and being willing to be told the answer was no, do it differently.

That’s the shift worth talking about.

AI is a channel, not a feature