I Put My AI Research System's Failure Rate on Its Own Front Page

The numbers

393canonical AI-generated artifacts

393pass my strict audit

0currently fail it

I built an agentic research system called Enoch. It queues ideas, gates worker dispatch, supervises GPU runs, syncs evidence, and packages the results as AI-generated research artifacts. The public corpus now contains 393 canonical artifacts: 376 unique topics from the duplicate-cleanup pass plus 17 later finalized corpus imports.

That number changed during the release cleanup. The first public pass had 497 slug directories. A dedupe audit found 119 duplicate clusters and 121 duplicate directories. The cleanup kept the unsuffixed canonical slug, removed the suffixed siblings, and recovered 118 populated claim ledgers from older duplicate siblings before deleting them. The result is smaller and more honest: 376 unique topics, not 497 directory entries.

I also built the audit gate that inspects those artifacts. Its job is to refuse anything that does not meet a strict claim-and-evidence contract. The dedupe and evidence-sync passes improved the severe failure modes: empty claim ledgers are gone, public result-file references are accounted for, and the strict gate now passes the corpus. That does not make the papers scientifically correct; it means their generated claims have an inspectable evidence contract.

That audit status is on the front page of the project. Not buried in a quality report. Not hidden behind a generic green checkmark. Headlined next to the project hero image, with links to representative passing artifacts and to the audit report.

People keep asking me why I would do that. The honest answer is that every other AI research release I have seen is implicitly making claims it cannot defend, and I did not want to ship one more of those.

Why the number is there

There is a second number on the same page: 393 of 393 pass the packaging and provenance lint. Every artifact has the required metadata, an AI provenance notice, no placeholder citations, no unsupported claims of human authorship or peer review, and no obvious overclaim patterns. That gate is real and non-trivial.

It is also, by itself, dangerous.

"393 of 393" is the kind of headline number that can imply correctness if it is not scoped. A reader who sees only the packaging/provenance count would be forgiven for thinking the corpus has been scientifically validated. It has not. The packaging and provenance lint validates that the artifacts are shaped like research. It does not validate that they are right.

If I shipped only the first number, I would be telling a narrow truth in a way that produced a broader lie. So I shipped the second number too. Side by side, the same size, the same visual weight:

393/393 pass packaging and provenance lint.
393/393 pass strict claim and evidence audit.

The two numbers together tell the real story. The first says the corpus is well-formed. The second says every artifact now has the stricter claim/evidence accounting needed for public inspection. Neither number says the papers are peer-reviewed or scientifically correct.

What the strict gate actually checks

The strict audit script is not a rubber stamp. It is the sharpest gate in the release.

For every paper, it requires a non-empty claim ledger. A claim ledger is a structured list of the specific claims the paper makes. Not the abstract, not the conclusion paragraph, the discrete claims. "Throughput increased by 1.6x on workload X." "The model converges in 3000 steps." "P95 latency under load Y is 140ms." If the ledger is empty, the paper is rejected.

For papers with a non-empty ledger, every claim must link to an evidence reference. Not a citation to an external paper. A pointer to an artifact in the corpus: a log file, a CSV, a benchmark output, a config. Each claim gets a specific line in a specific file that supports it. Without that link, the claim is rejected.

For every evidence reference, the referenced file must either be publicly present in the corpus repository, or explicitly declared unavailable. Declared unavailable is not a free pass. It requires four things: a reason, an SHA-256 hash, a byte count, and a public surrogate. The surrogate is a substitute file that is cryptographically linkable to the original. The script opens the surrogate, re-hashes it, and verifies the hash and byte count match what the paper claims.

Absolute private paths are rejected. A reference like /private/runs/bench.csv describes a file on one machine, not a file anyone else can inspect. The script refuses it.

That is the gate. It is strict on purpose. It is not asking "did the paper look professional?" It is asking "if someone wanted to check your claims, could they?"

For the current corpus, the script tracks 1,179 result-file references and reports 0 missing public result-file references. That is the current pass condition: every referenced result file is either present in the public repository or explicitly declared unavailable with public surrogate metadata.

I could have shipped the release without this number being visible. I chose not to.

What the cleanup taught me

The dedupe pass is the part I almost wish I had done before publishing the first version of this essay, because it made the release more interesting and more uncomfortable.

The newer canonical papers often had the better prose. The older suffixed siblings often had the better audit discipline: populated claim ledgers, even when the prose was shorter. If I had simply deleted the duplicates, I would have kept the cleaner directory layout and thrown away useful evidence structure. So the cleanup did the less glamorous thing: it merged discipline back into the canonical artifact before deleting the duplicate.

That is the actual lesson. AI research release hygiene is not just writing better papers. It is preserving the boring connective tissue — claim ledgers, manifests, result-file references, provenance fields — when the pipeline changes shape. The prose can improve while the evidence model regresses. Unless a validator catches that, the release looks better while becoming less auditable.

After cleanup, the denominator fell from 497 to 376 and 118 claim ledgers were rescued; 17 later finalized corpus imports moved the live denominator to 393. The later evidence-sync work moved the strict pass count to 393. That is the key fact: duplicates were one problem, but missing public evidence was the real release blocker.

Representative passing examples

The corpus now clears the gate. The landing page still links to representative artifacts so readers can inspect what a passing audit bundle looks like, including the surrogate-hash path for declared-unavailable files.

The examples include substantive technical reports, not only a synthetic demo: a vLLM continuous-serving stress campaign, a retrieval-augmented SSM architecture writeup, and a worked strict-audit bundle.

The goal was never a small showcase. The goal is every featured paper in the corpus clearing the same public evidence contract, with private workstation paths replaced by public files or public surrogate metadata.

That full-corpus pass is on the front page today. Not because it proves scientific correctness, but because readers should see exactly which gate passed and what it does not validate.

What most AI research releases do instead

This is the uncomfortable part.

Pick up the release of any recent AI research artifact. An auto-research agent, a benchmark suite, a foundation-model technical report, a "here is what our system discovered" blog post. Almost all of them do one or more of the following:

Cite results that are not accompanied by the raw logs, CSVs, or traces a reader would need to verify them.
Reference "internal benchmarks" or "our evaluation harness" without publishing the harness or the raw output.
Use screenshots of numbers instead of the structured data that produced the numbers.
Make claims in prose that are not tied back to a specific line in a specific artifact.
Publish the output but not the evidence trail that links the output to reality.

These practices are not unique to AI research. Science has always had reproducibility problems. But AI research has a unique property that amplifies the problem: the outputs themselves are generated by a system that is known to hallucinate confidently and cite sources that do not exist. A hallucination-prone system producing prose that cannot be traced back to inspectable artifacts is not research. It is plausible-looking text.

If you ran my strict audit script against most published AI research corpora, you would get a pass rate in the single-digit percentages. Possibly lower than mine.

The packaging lint would pass. The strict audit would not. Almost nobody is running the strict audit.

I am not claiming my corpus is better than those releases. My corpus is mostly AI-generated prose that has not been peer-reviewed, replicated, or corrected by a human author. It is released as AI-generated output, explicitly, with no authorship claim and a CC0 license. On that axis, I have no high ground.

The thing I am claiming is that I built the gate that surfaces the evidence contract and I put that gate on the front page. That is a different claim. It is a claim about how to release AI-generated research output, not a claim about the quality of any specific paper.

The gate is the product. The papers are what the gate audits.

Why I think this matters for the field

Agentic AI systems are going to produce a lot of research output. Some of it will be valuable. Most of it will not be. That ratio is not avoidable in the short term — it is a property of the systems we have, not a failure of discipline.

What is avoidable is the practice of releasing the output without the infrastructure to tell the difference.

If an AI system produces 500 research reports and a reader cannot distinguish which ones have verifiable evidence from which ones are plausible-sounding prose, the whole output is untrustworthy. Not because the valuable 5% is wrong — it probably is not — but because the untrustworthy 95% contaminates the signal. A reader who encounters one hallucinated citation will rightly mistrust the next hundred, even if some of those hundred are fine.

The fix is not "stop publishing AI-generated research output." Agents are going to generate this material whether we release it or not. The fix is to release it with infrastructure that separates the verifiable from the unverified, so a reader can see which is which. A packaging lint is not enough. A strict claim-and-evidence audit, with public pass rates, is the minimum bar.

Put differently: the interesting artifact in an AI research release is not the paper. It is the audit trail around the paper. A paper without an audit trail is an assertion. A paper with an audit trail is a claim with receipts. An AI research system that ships papers without building the audit infrastructure is shipping assertions at scale.

What you should do if you are releasing AI research output

If you build a system that generates research artifacts and you intend to release those artifacts publicly, some suggestions from someone who just did this and is still figuring it out:

Build the claim ledger before you build the paper writer. The claim ledger is harder to bolt on than it looks. If the paper writer generates prose freely and then you try to extract claims from the prose afterward, you will miss things and you will miss them in ways that correlate with whatever the writer's biases are. Build the ledger as a first-class output of the pipeline.

Every claim should link to an artifact, not a cited source. A reference to "the benchmark results" is not evidence. A reference to papers/slug/results/bench-2026-05-04.csv, line 42, column 3, is evidence. The granularity matters. If the granularity is not at the artifact level, a reader cannot check.

Ship the artifacts in the same release as the prose. Do not separate them. The moment your paper references a file that is not in the repository, that claim is unverifiable. This is obvious in principle. It is violated constantly in practice because committing the actual result files is inconvenient and the prose looks fine without them.

Publish the audit status. Build an audit script that is stricter than your publication pipeline, run it against your corpus, and put the pass rate on the front page. Whatever number you get, put it up. If the number is small, the number is small. If it improves, keep showing the gate instead of collapsing it into a vague green checkmark.

Make the gate link to passing examples. Passing artifacts should be findable in one click from the pass rate. Readers need to be able to see what good looks like, not just the ratio.

Do not confuse packaging discipline with epistemic discipline. A well-formed paper with hallucinated citations is worse than a malformed paper with real ones. Packaging gates are easy. Epistemic gates are hard. Treat them differently.

What I am actually building

Enoch is not primarily a paper-writing system. The paper-writing is a consumer of a pipeline whose real job is state safety for long-running AI work: queue discipline, worker preflight, single-lane GPU dispatch, process-tree tracking, and evidence sync. The papers are what fall out when you run that pipeline on research tasks.

But the audit gate is the part I think is generalizable. You do not need Enoch to use it. The contract — claim ledger, artifact-linked evidence, hash-verified surrogates for declared-unavailable files, absolute-path rejection, public pass rate on the front page — is not specific to my stack. Any AI research release could adopt it. Most could adopt the first 80% in a weekend.

If enough releases did, the ratio of verifiable-to-unverifiable AI research output would shift. Not by producing more correct output. By making it easier for readers to see which output is which.

The current corpus passes the strict gate. I am not treating that as proof that the papers are correct. I am proud of the gate because it forces the release to carry an explicit evidence contract.

The number stays on the front page. That is the deal.