Project

K.E.V.I.N.

Knowledge Extraction & Vulnerability Injection Network — an autonomous AI red-teaming framework, and a running log of what breaks frontier models.

K.E.V.I.N. is a research loop with three roles. A target AI is given a secret and told never to share it. An attacker AI tries to extract it anyway, refining its tactics after every rejection. A panel of vendor-diverse judge AIs scores each attempt independently, and the median is the verdict. It runs unattended, remembers what didn't work across runs, and is grounded in the published jailbreak literature and the NIST AI Risk Management Framework plus MITRE ATLAS.

The point isn't to attack anyone's deployment. It's to make ten rounds of systematic, scored adversarial pressure cheap and repeatable — so the answer to "how does this hold up at the edges?" is something you've watched, not something you assume.

On responsible use

K.E.V.I.N. only points at models the operator controls — self-hosted, or a provider API you authenticate against with your own keys, evaluating personas you defined. Every target, company, and secret in these write-ups is synthetic.

It is not a tool for probing public chatbots or anyone else's systems without authorization, and there is no generic "point it at a URL" target by design.

Findings

All K.E.V.I.N. posts →

How it works

  • The loop is a state machine. Built on LangGraph: attacker → target → judge → route, looping until success, a turn cap, or a budget cap.
  • Scoring is honest by construction. A panel of three vendor-diverse judges scores each attempt and the median wins, implementing the StrongREJECT rubric — because a single LLM judge disagrees with itself often enough to be untrustworthy.
  • Ground truth is deterministic. Whether a secret actually leaked is decided by exact-match, canary tokens, and a format regex — not by a model's opinion.
  • Runs are reproducible. Every run writes a manifest (git SHA, model versions, hashed prompts, seed, budget) before the first model call, so a finding can be replayed exactly.

It's an Apache-2.0 framework covering thirteen attack vectors plus a deterministic output-guard defense. The write-ups above are the public record; the code stays private for now while the project is early.

New findings land here as I run them. Subscribe via RSS to follow along.