TaxPilot

UK Self Assessment AI agent · 2026 · ongoing

A vertical AI agent for UK personal tax — the kind of work I spent five years doing by hand at PwC and Blick Rothenberg. The hard part isn't language fluency; it's keeping a model from confidently citing a rate that changed in last year's Finance Act, or applying a rule from a manual page that's been replaced.

The architecture is the answer to one question I kept failing to answer cleanly with flat RAG: where does each kind of knowledge actually live, and what makes it rot?

Three layers, each owned by a different rot mechanism

A flat vector store treats every document the same. That's fine for chat, not for tax. A statute, a manual page, a Finance Act amendment, and a procedural how-to decay at completely different rates and require completely different update workflows. So I split them.

Layer 1 — Skill (procedural)

Decision trees extracted from ATT and CTA syllabi. Each procedure (computing trading income, allocating personal allowance, applying transferable allowance, residence statutory test) becomes a deterministic walk: input → branches → output, with citations attached at each branch.

Why this isn't RAG: the procedure rarely changes. The shape of "compute non-savings income, then savings, then dividend" is the same in 2024 and 2026. What changes is the rates, thresholds, and bands — those don't belong inside the procedure.

So procedures contain placeholders, never values:

// Pseudocode shape
if (non_savings_income > [BAND: basic_rate_limit]) {
  apply([RATE: higher_rate]);
}

Placeholders resolve at runtime against a separate, dated rate table. When HMRC publishes new rates for the next tax year, you update one table; every procedure that references it is correct the next morning. This is the part I would not have designed without having lived through five April rate-change cycles by hand.

Layer 2 — RAG (citable authority)

HMRC Manuals (the EIM, BIM, CG, SAIM, SAM series) and legislation.gov.uk Income Tax Act 2007 / TCGA 1992 / ITTOIA 2005 chunks, embedded with chunk-level metadata (manual code, paragraph reference, last-update date). The agent does not generate a tax position without citing the manual paragraph or the section of the Act that supports it.

The discipline here mirrors what advisors are taught: position without authority is an opinion, not advice. When the model produces an answer, the supporting citations are not optional metadata — they are the answer. If the citation can't be retrieved, the answer doesn't ship.

Layer 3 — Updates (continuous)

Finance Act amendments, HMRC consultations, case law from FTT/UT/Court of Appeal, and professional journals (Tax Adviser, Taxation magazine). This layer feeds back into the other two: an amendment updates the rate table; a tribunal decision adds a footnote to a manual chunk; a consultation flags a procedure for review.

Without this layer, layers 1 and 2 silently rot. The model keeps citing manual paragraphs that have been superseded; the rate table drifts from reality. Most "AI tax assistant" demos I've seen ship without anything in this slot, which is why they look great on launch and embarrassing six months later.

ATT/CTA past papers as a regression test set

Pleasant accident of having sat these exams: every year, the professional bodies publish past papers with model answers. They cover exactly the surface area a competent agent should handle — capital gains computations, pensions input annual allowance, dividend stacking, residence/domicile, partnership allocations.

So they become the regression test set. Every architectural change runs against the bank. A change that improves chat quality but degrades CGT computation accuracy on the 2023 ATT Paper 2 is a regression, not an improvement, and gets rolled back.

This is the cleanest objective evaluation surface I've found for vertical-tax LLM work. Generic benchmarks (MMLU, etc.) tell you nothing about whether the model can correctly stack non-savings, savings, and dividend income through the bands.

HMRC Developer Hub integration (planned)

The end-state isn't a chatbot. It's an agent that prepares a complete SA100 return against HMRC's Making Tax Digital APIs:

OAuth 2.0 with HMRC Developer Hub (sandbox first)
Fraud Prevention Headers (mandatory for MTD; specific device fingerprint format)
Self Assessment Individual API endpoints for income / allowances / submission
Local-first audit trail — every API call recorded, signed, and replayable

The agent doesn't replace the advisor; it does the assembly. The advisor reviews the computation and authorises submission. That separation is non-negotiable for anything that goes to HMRC under a real UTR.

What's not done

This is the honest section. As of writing:

The Skill layer covers Income Tax and CGT for individuals. Not yet partnerships, not yet trusts, not corporation tax.
The Updates layer is manual cron + curated feeds, not automated. Each Finance Act needs human-in-the-loop diff review.
HMRC Developer Hub integration is in sandbox only. Production submission requires an agent services account and MLR registration, both of which I'm not setting up until the rest is solid.
The exam-paper regression set has 60-70% coverage. The remaining 30% (mostly partnership and complex residence cases) still requires manual annotation.

Held back from public release pending the Updates layer being something I'd trust on a real client return. The version that exists is good enough to assist me; not yet good enough to ship to a tax advisor who doesn't already know its limits.

Why I'm building this

Five years inside UK tax. The work is regulated, messy, and the consequences of a wrong answer are real (HMRC penalties, disclosure obligations, professional indemnity claims). It's exactly the kind of vertical where generic LLMs are dangerous — confident, fluent, and wrong in ways non-experts can't catch.

Most of the AI tax tools I've seen are built by ML engineers without tax backgrounds, or tax people without engineering backgrounds. The interesting seam is sitting on both sides at once — knowing what HMRC will and won't accept, and also knowing why an embedding distance between two manual paragraphs is sometimes meaningless.

This is the kind of project where domain knowledge does the load-bearing work. The architecture above isn't elegant because I'm a clever architect; it's the architecture you arrive at when you've watched a Skill (procedure) survive five Finance Acts unchanged while every Rate (value) needed updating.