Multi-agent security research workstation

Self-hosted. Three months, ~800 commits. Produced an Intigriti 10.0 / Exceptional report.

What it is

A self-hosted multi-agent system that runs the whole vulnerability-research loop: reconnaissance, hypothesis generation, evidence accumulation, adversarial chain-of-reasoning audit, and report drafting. Targets are live bug-bounty programmes.

Output across the first three months: an Intigriti 10.0 / 10.0 Exceptional rating against CM.com's admin API (broken object-level authorization, cross-tenant — redacted methodology below); authenticated SSRF via webhook header injection on Cambium Networks; OAuth/PKCE flow analysis on Venly; responsible-disclosure on Bild.de NewsBot (system-prompt reconstruction + multi-turn prompt injection bypassing the bot's four stated rules); AI/ML supply-chain disclosures on MLflow, Keras, and Vanna.

Neo4j (knowledge graph) Postgres + pgvector Anthropic SDK / MCP multi-model routing — Claude / DeepSeek / local Qwen / Ollama browser pool with Chrome DevTools Protocol ~18 self-built plugin modules

Architecture, top-down

Four components hold the system together.

1. Shared knowledge graph as a common whiteboard

Facts, hypotheses, refutations, goals — all live in one Neo4j graph that every agent reads and writes. The schema went through repeated rebuilds. An early hypothesis-centric design encouraged agents to invent claims and back-fill evidence after; it was restructured so that facts come first and hypotheses must grow out of them. The graph currently holds ~380,000 stored facts.

Threat ontologies (CVE / CWE) are modelled as graph nodes for native agent retrieval. Hybrid retrieval: vector-semantic plus keyword. Subagent role splits (Plan / Explore) keep the context window of any single agent narrow.

2. The agent roles

Per-target long-running sessions with role-scoped system prompts:

scout — reconnaissance, fills the fact layer.
builder — positive side, surfaces hypotheses out of facts.
connector — positive side, walks goal-oriented chains backward from a target outcome.
breaker — adversarial side, audits whole reasoning chains and the supporting evidence.

Plus three global skill agents that can be spawned by any of the above: observer (returns to the field with proof), operator (runs structured checklists), plan (steps a question through to an actionable plan without executing).

3. Evidence accumulation, not conclusion-first

Observations enter the graph as small atomic facts with a single label (confirmed / candidate / unverified). Confirmed = runtime-verified; candidate = source-code or static evidence; unverified = inference. A finding is not declared without confirmed evidence at both source and sink.

4. The adversarial review loop

Discussed in detail below. This is the design decision that shaped everything else.

Five generations of "good enough"

The workstation has been rewritten almost continuously. Each generation redefined what "perfect" meant — the previous generation's perfect became the next generation's joke.

Generation 1 — AI as a crutch

Early targets, including Venly's OAuth/PKCE work. I would hand Claude what I didn't understand and let it propose hypotheses. The model was too generative — twenty "if X then Y" chains, all plausible, none evidenced. The "perfect" of this generation was "agent gives me more hypotheses." Wrong axis.

Generation 2 — line-review

Borrowed straight from how Claude Code itself reviews diffs: the reviewer doesn't see your summary of intent, only the diff in front of it. Better, but the reviewer was still grading the working context's own immediate output — not the chain of reasoning that produced it.

Generation 3 — the hypothesis engine (failed)

I tried to fight fabrication head-on by making Hypothesis a first-class node type. It failed. Once "hypothesis" had a slot, the agent fabricated more freely, not less. The lesson — which I now carry across every system I design — is: don't give the model a scaffold that smooths the path for the failure mode.

Generation 4 — if-then controller (failed)

External rules bounding agent action. Wrong again. The rules were either rigid (missed cases) or loose (no constraint). Static rules don't tame a dynamic system.

Generation 5 — evidence-first triple store (current)

Facts deposit first; hypotheses grow out of facts; an adversarial reviewer audits the reasoning chain. This is what's running now. I already know there will be a sixth.

The decision: the adversarial reviewer reads the thinking trace, not the conclusion

The intuitive design is to give a "red team" agent the positive agent's final claim ("this is a vulnerability") and let it judge correctness. I went the other way. The reviewer never sees the conclusion. Only the reasoning chain.

Why

LLMs anchor. Once the model is given a conclusion, its next-token distribution skews toward "go along with this conclusion" — whether the response surface is "support" or "rebut." The anchor is already shaping the response. That isn't a prompt-engineering bug to patch around; that's the transformer's probability behaviour under anchoring.

LLM sycophancy is anchoring. Don't give the model the anchor.

Input shape

What the reviewer actually receives is not "the positive agent says X is a vulnerability — audit it." It is closer to:

"On step 3 of its reasoning, the positive agent claimed: this 200 OK is anomalous, because nearby endpoints return 403. Judge that single inferential step. Does the reasoning hold given the supporting facts referenced from the graph?"

The reviewer sees one reasoning hop at a time. It has no awareness of what larger story the hop belongs to. It can only return yes/no based on whether that single causal jump holds, against the facts the positive agent cited from the graph.

The flow, drawn out

   GRAPH                         POSITIVE                    ADVERSARIAL
   (facts)                       AGENT                       REVIEWER

   evidence ─────────────────►  step 1: claim A
                                   |
                                  cited fact ──────────────► judge step 1
                                                             [yes]
                                step 2: claim B
                                   |
                                  cited fact ──────────────► judge step 2
                                                             [yes]
                                ...
                                step N: claim FINAL
                                   |
                                  cited fact ──────────────► judge step N
                                                             [yes]

   chain accepted only when every hop survives independent audit.
   reviewer never sees the conclusion N until after it has signed off
   on every step that produced it.

Why this matters — the CM.com chain only formed because of this

The positive agent wanted to drop the 200 OK as too weak a signal. I pushed it to keep asking why. The reviewer then audited each premise independently:

"Is 200 OK on this endpoint anomalous, given that nearby paths return 403 / 500?"
"If the request is replayed with a flipped organisationId, are server responses consistent in shape and status?"
"If a payload field looks like a customer credential, is its presence in an ordinary API response an authorization failure or expected behaviour?"

At each audit, the reviewer had no idea the chain would terminate in 69 customer gateway credentials across 9 organisations. It was only judging local causality. Every hop survived. The chain stood.

This is a root-cause pattern, not a one-off trick

The same shape appears in three other places I've watched closely:

Anthropic Claude Code's line-review — the reviewer sees the diff and the diff alone, not your summary of why you wrote it. The same anti-anchoring principle: don't let the working context's framing reach the reviewer.
Anthropic Claude Code's Compact — when a long conversation is auto-compacted, the deciding agent looks at the working context from the outside, while the working context is still alive. It can see what the working context can't: what's actually load-bearing.
Nous Research's Hermes Agent memory mechanism — periodic self-prompts decouple long-running behaviour from the immediate working context, so the runtime doesn't drift toward "what's in the buffer right now."

Different surface failures. Same root pattern:

Let an outside context judge what the working context can't see — while the working context is still alive.

Once I had that abstracted, I started using it across the workstation. It now runs in four places:

The adversarial reviewer over chains of reasoning (described above).
End-of-subtask external verifiers — when an agent finishes a step, a separate context confirms the artifact actually exists in the graph and matches the spec.
Spec-vs-runtime two-way diff — running code is taken as ground truth, and stale documentation is automatically flagged for archival.
Periodic self-feedback — agents append flow / tool / sp / efficiency / stuck entries to dated jsonl files; a separate context reads and acts on them.

A working agent can't reliably detect its own gaps. That's a logically false ask of a single context.

Case study: CM.com — Intigriti 10.0 / 10.0 Exceptional

Redacted methodology only. No endpoints, customer names, PoC payloads, or credentials.

Bug class

Broken object-level authorization (BOLA / IDOR-style) across an admin API surface. Cross-tenant read and write on configuration objects belonging to other organisations.

How the chain formed

Weak signal. A scout agent operating from a free-tier account observed a single admin-side path returning 200 OK. The default path of least resistance — drop the signal as noise. I pushed instead: ask why this single endpoint behaves differently from the family of nearby paths. Ask what object it returned. Ask what 403 vs 500 vs 200 means in this product's tenant model.

Mapping by behaviour, not by endpoint name. The next move was to capture browser/API traffic during ordinary product use, map endpoint families, and pay attention to identifier shapes — anything that looked like organisation, customer, route, handler, or gateway IDs. The crucial question was not "is this endpoint documented?" but "does the backend enforce tenant ownership, or does it merely accept object identifiers without re-validation?"

Read first, write second, narrow PoC. Cross-tenant read access was confirmed first, against routing-configuration objects and message-handler records. Only after read was confirmed did I move to state-changing methods (PUT / POST / DELETE), and only on the smallest reversible modification path I could verify. Destructive testing avoided. Returned payloads exposed customer-side credentials, which were validated as usable while keeping verification minimal and documented.

Impact

Cross-tenant read / write across the admin API surface.
Configuration-object exposure at scale (tens of thousands of records).
Customer credentials present in ordinary API responses across multiple organisations, including high-impact customers.

The remediation recommendations I submitted

Enforce object-level tenant authorization centrally across the admin API layer.
Remove trust in user-supplied tenant / customer identifiers.
Ensure every read and write path checks ownership server-side, not via opaque token validation alone.
Strip secrets from ordinary API responses.
Rotate all exposed credentials and add audit logging for cross-tenant access attempts.
Add regression tests specifically for BOLA-style cross-tenant access across each admin endpoint family.

What this finding taught me about my own workflow

Earlier OAuth / PKCE work taught me that AI-generated hypotheses are powerful and dangerous on the same axis: an agent can quickly produce many plausible "if X, then Y" chains, but a real exploit has no room for unsupported "ifs." That failure pushed me from a hypothesis-first architecture to an evidence-first one. The CM.com chain was the first case where the new architecture earned its keep — small observation units, hypotheses grown out of them, and an adversarial reviewer auditing the reasoning chain step by step before the chain was trusted.

What's still wrong

The system is not finished and the next generation is already visible:

The reviewer is still a single context. It can be tricked by a positive agent that learns to "pre-justify" each step. The next iteration probably needs adversarial-reviewer ensembles trained against each other rather than a fixed prompt.
Atomic-fact extraction (the "atom" pipeline) was rewritten three times in the last month and is between architectures right now — facts versus card-shaped knowledge units. The wrong primitive lets failure modes grow back.
Cross-target memory transfer is weak. Findings on one programme don't yet seed hypotheses on another. The graph is a whiteboard but not yet a teacher.

"Perfect" is not a static endpoint. It's a moving standard. I redefine it about every two weeks as my understanding of the system changes.

← back