Blog

Reproducibility is a furniture problem

2026-05-12 · 8 min read · researchinfra

We talk about reproducibility like it’s a CI problem — get the same answer twice, with bit-for-bit determinism. That framing skips a much more common failure mode: a paper is hard to reproduce not because the answer is unstable, but because the furniture around the answer is missing.

Furniture, not floors

A model checkpoint without the tokenizer it was trained against is a floor without a chair. A notebook without its requirements.txt is a kitchen without cabinets. You can live in it, but only by treating every sit-down as an archaeology project.

The instinct of the field has been to demand more standardization: pin everything, freeze everything, ship a Docker image. That works, sometimes. More often it produces a 12 GB tarball that nobody opens.

What I’ve found actually helps

Three small habits that, when I see them in a repo, raise the reproducibility ceiling more than any single tool:

A seeds.md that lists every random seed used and what it controls. Not in code — in a file you read with your eyes.
A data.md that explains what shape the data should be in, with a tiny example included as plain text.
One worked example that runs in under 90 seconds on a laptop. Doesn’t have to match the headline result. Just has to run.

Tiny habits

90s

Worked example budget

Those three files take half an afternoon to write. They’ve never failed me. The Docker image has — twice last year.

The deeper bet

Software is not the problem here. It’s a paper-writing problem. We write methodology sections as proofs (“we did this, therefore the result holds”) when they should be IKEA instructions (“here is each step, in order, with a picture, and a number to call if the screw is missing”).

Until then: write the seeds file.