Reproducibility is a furniture problem
We talk about reproducibility like it’s a CI problem — get the same answer twice, with bit-for-bit determinism. That framing skips a much more common failure mode: a paper is hard to reproduce not because the answer is unstable, but because the furniture around the answer is missing.
Furniture, not floors
A model checkpoint without the tokenizer it was trained against is a floor without a chair. A notebook without its requirements.txt is a kitchen without cabinets. You can live in it, but only by treating every sit-down as an archaeology project.
The instinct of the field has been to demand more standardization: pin everything, freeze everything, ship a Docker image. That works, sometimes. More often it produces a 12 GB tarball that nobody opens.
What I’ve found actually helps
Three small habits that, when I see them in a repo, raise the reproducibility ceiling more than any single tool:
- A
seeds.mdthat lists every random seed used and what it controls. Not in code — in a file you read with your eyes. - A
data.mdthat explains what shape the data should be in, with a tiny example included as plain text. - One worked example that runs in under 90 seconds on a laptop. Doesn’t have to match the headline result. Just has to run.
Those three files take half an afternoon to write. They’ve never failed me. The Docker image has — twice last year.
The deeper bet
Software is not the problem here. It’s a paper-writing problem. We write methodology sections as proofs (“we did this, therefore the result holds”) when they should be IKEA instructions (“here is each step, in order, with a picture, and a number to call if the screw is missing”).
Until then: write the seeds file.