Back to home
Blog

Reproducibility is a furniture problem

2026-05-12 · 8 min read · researchinfra

We talk about reproducibility like it’s a CI problem — get the same answer twice, with bit-for-bit determinism. That framing skips a much more common failure mode: a paper is hard to reproduce not because the answer is unstable, but because the furniture around the answer is missing.

Furniture, not floors

A model checkpoint without the tokenizer it was trained against is a floor without a chair. A notebook without its requirements.txt is a kitchen without cabinets. You can live in it, but only by treating every sit-down as an archaeology project.

The instinct of the field has been to demand more standardization: pin everything, freeze everything, ship a Docker image. That works, sometimes. More often it produces a 12 GB tarball that nobody opens.

What I’ve found actually helps

Three small habits that, when I see them in a repo, raise the reproducibility ceiling more than any single tool:

  1. A seeds.md that lists every random seed used and what it controls. Not in code — in a file you read with your eyes.
  2. A data.md that explains what shape the data should be in, with a tiny example included as plain text.
  3. One worked example that runs in under 90 seconds on a laptop. Doesn’t have to match the headline result. Just has to run.
3
Tiny habits
90s
Worked example budget

Those three files take half an afternoon to write. They’ve never failed me. The Docker image has — twice last year.

The deeper bet

Software is not the problem here. It’s a paper-writing problem. We write methodology sections as proofs (“we did this, therefore the result holds”) when they should be IKEA instructions (“here is each step, in order, with a picture, and a number to call if the screw is missing”).

Until then: write the seeds file.