Negative Result Repository: Learning from Failed Experiments at Machine Speed

85% of experiments fail. What if we stopped throwing them away?

Shiro Takagi


§1 The waste problem in autoresearch

Autonomous research pipelines generate experiments at machine speed. Karpathy's autoresearch runs 100+ experiments per night, but roughly 85% are discarded or crash. In my own autoresearch-lite session, 18 out of 21 experiments failed.

Currently, these failures are stored as single rows in a TSV file:

61b4eb4  0.693100  1.1  discard  Increase learning rate from 0.01 to 0.1
1a88964  0.444700  1.1  discard  Switch optimizer from SGD to AdamW
72ce38e  0.000000  0.0  crash    Add residual connections to the CNN

This is the entirety of what the system learns from each failure: a commit hash, a number, a status, and a sentence. The system then proceeds to the next experiment, often proposing changes that are structurally similar to things that already failed.

In my 21-experiment session, the LLM proposed increasing the learning rate four separate times (to 0.1, 0.012, 0.013, and 0.015), all of which failed. It also tried three different architectural expansions (doubling filters, increasing FC size, adding a conv block), all of which failed. The information was there in the results.tsv — but in a format that neither the LLM nor any downstream system could efficiently query.

This is a metascience problem. In traditional science, negative results are systematically underreported, creating publication bias. In autoresearch, negative results are generated at scale but discarded immediately. The failure mode is different — not bias, but waste.

§2 What a negative result actually contains

A failed experiment is not just "it didn't work." It contains:

  • A config diff: what specific parameter changed, from what value to what value
  • A quantitative outcome: how much worse (or how close to baseline)
  • A failure category: regression (significantly worse), no improvement (clearly below), marginal (very close), or crash (code/infra error)
  • An implicit constraint: "this region of the search space is unpromising"

When you have 18 such data points, patterns emerge. But only if the data is structured enough to aggregate.

§3 Four capabilities

The Negative Result Repository (NRR) is a prototype with four capabilities, each addressing a different aspect of the waste problem:

3.1 Failure structuring

The parser converts each TSV row into a structured object with:

  • Extracted config diffs (e.g., LEARNING_RATE: 0.01 → 0.1)
  • Failure classification: no_improvement (11), regression (3), marginal (2), crash_code (1), crash_infra (1)
  • Change category: what type of change was attempted (learning_rate, architecture, regularization, etc.)
  • A computed lesson: human-readable explanation of what this failure means

The parser tracks config state across the experiment sequence: after experiment 7 was kept (NUM_EPOCHS: 10→15), subsequent experiments' baselines reflect that change. This matters because experiment 22's "reduce weight decay from 5e-5 to 2e-5" is relative to the post-experiment-21 config, not the original baseline.

3.2 Similarity search

Each failure is encoded as a feature vector combining accuracy delta, crash status, change category (one-hot), failure category (one-hot), number of config changes, and magnitude of numeric changes. Cosine similarity finds the most relevant past failures for a proposed new experiment.

Example: querying "increase learning rate to 0.05" returns:

sim=0.976 | [61b4eb4] delta=-0.016 | Increase LR from 0.01 to 0.1
sim=0.821 | [7e11bde] delta=-0.005 | Double the number of filters
sim=0.688 | [437c019] delta=-0.006 | Switch from ReLU to GELU

The top result is exactly right: the most relevant past failure for an LR increase is the previous LR increase attempt. The system surfaces this before the experiment runs, potentially saving 60+ seconds of compute per avoided experiment.

3.3 Pattern aggregation

Individual failures are aggregated into rules. The system operates at two levels:

Category-level patterns (what type of change consistently fails):

  • [1.00] Learning rate changes: 5 failures, avg delta -0.013. "UNLIKELY to improve."
  • [0.96] Architecture changes: 4 failures (1 crash). "CAUTION: 25% crash rate."
  • [0.48] Regularization changes: 2 failures, avg delta -0.016.

Direction-level patterns (which direction of change fails):

  • [1.00] "AVOID increasing LEARNING_RATE. All 4 attempts failed."

The direction analysis is particularly useful: it distinguishes "increasing LR always fails" from "any LR change always fails" — which matters because the one LR decrease (to 0.005) was marginal rather than clearly bad.

3.4 Autoresearch loop integration

The check_proposal() interface is designed to be called before each experiment in an autoresearch loop. It returns a recommendation (proceed/caution/avoid), an estimated success probability, the most similar past failures, and relevant patterns.

Applied to five hypothetical proposals against the 21-experiment database:

ProposalRecommendationProbabilityBasis
Increase LR to 0.02AVOID0%5/5 LR changes failed; all 4 increases failed
Switch to AdamWAVOID0%Previous AdamW attempt: -0.265 regression
Add residual connectionsAVOID0%Previous attempt crashed; 25% architecture crash rate
Reduce batch size to 64PROCEED33%No strong signal against (only 1 batch size experiment)
Mixed precision trainingCAUTION27%Novel change, but multiple-change category has 50% crash rate

The first three recommendations are correct: the database has strong evidence these directions fail. The fourth is also correct: batch size reduction is genuinely unexplored territory. The fifth is reasonable: mixed precision is novel enough to try but caution is warranted given limited data.

§4 What I learned building this

The description field is the hardest part. The LLM-generated descriptions in results.tsv are natural language, and extracting structured config diffs from them requires pattern matching that is specific to each autoresearch setup. "Reduce weight decay from 1e-4 to 5e-5" parses cleanly. "Add a fourth convolutional block with 256 filters to increase model depth" requires knowing what "fourth block" means for this architecture. A standard machine-readable output format for autoresearch systems would eliminate this entire parsing layer.

18 failures is enough for patterns, not for statistics. The pattern "all 4 LR increases failed" is convincing to a human, but a frequentist would note that n=4 is not significant. At scale (hundreds of experiments), these patterns would carry real statistical weight. The prototype shows the structure works; scale would make it rigorous.

Negative results form a search space map. The 18 failures collectively define a region of the hyperparameter space that has been explored and found unpromising. This is exactly analogous to how optimization algorithms use past evaluations — except here the "evaluations" come from an LLM's choices rather than a systematic search. The NRR converts unstructured LLM-driven exploration into something closer to a Bayesian optimization history.

The biggest value is preventing repeated failures. In my 21-experiment session, at least 3 experiments were structurally redundant: four LR increases that all failed for the same reason. If the system had checked the NRR after the second failure, it would have saved 2 experiments (~2 minutes of GPU time, 2 API calls). At scale — 100 experiments per night — preventing 10-20% redundant failures would meaningfully reduce cost and accelerate the search toward productive regions.

§5 Connection to the infrastructure stack

This is the fourth prototype in a series exploring autoresearch infrastructure:

The EED focuses on amplifying successes. The NRR focuses on learning from failures. Together, they cover both sides: propagate what works, avoid repeating what doesn't. The ECV adds a quality gate between them, and LNRA provides the knowledge representation layer.

The NRR fills a specific gap: while the other three tools operate after experiments complete, the NRR operates before the next experiment starts. It is the only tool in the stack that can prevent wasted compute rather than just organizing results after the fact.

§6 Limitations and next steps

Limitations:

  • The parser is specific to autoresearch-lite's description format. A different autoresearch system would need different extraction rules.
  • Similarity search uses a hand-crafted feature vector. Embedding-based search (using the description text) would be more robust.
  • Pattern confidence scores are heuristic, not statistically calibrated.
  • The system has not been tested in a live autoresearch loop (only retroactive analysis).

Next steps:

  • Integration with a live autoresearch-lite loop: inject check_proposal() between the LLM proposal step and the execution step
  • LLM-based description parsing as a fallback for the regex parser
  • Cross-session learning: accumulate negative results across multiple autoresearch sessions
  • Standard output format proposal for autoresearch systems

GitHub: t46/negative-result-repository