This post was originally pubished on the forums here

The Challenge

Way back in October 2020 the Papers With Code ML Reproducibility Challenge 2020 was launched and shared in the forums. A few of us jumped at the chance to test our ML knowledge and push our skills. Fast forward 110 days since that initial post and we delivered our Reformer Reproducibility submission via OpenReview! Here are a few reflections on our experience; what we enjoyed, tools we used and what we would have done differently:


  • Working as a team pushes your motivation, your skills and your throughput
  • nbdev for development, Weights & Biases for tracking and Discord for communication
  • We could have better used task/project management tools more, maybe we needed a different tool
  • Next time we’ll start experiments sooner and maybe pick a more practical paper
  • It was a massive learning experience and a lot of fun

Why participate

Implementing code from scratch is much more enjoyable and meaningful when there is a direct application, e.g. working towards this reproducibility challenge. Spending weeks and months focussed on a single paper forces you to understand the paper down to the last full stop. It also gives you a great appreciation of how difficult writing a good paper is, you see almost every word and sentence is chosen carefully to communicate a particular concept, problem or model setting.

N heads are better than one a.k.a. Multihead Attention

Our team was distributed across 6 countries and everyone had a somewhat different background, set of skills and personality. This mix was definitely beneficial for getting things done much more smoothly. Having 2 x N eyes researching implementation information or reviewing code really improved coverage and sped up the entire process. It also makes debugging much faster!

Writing code that the entire team will use also meant writing cleaner code with more tests so that it was as clear as possible for your teammates. And finally, during a long project like this it’s easy to get distracted or lazy, however seeing everyone else delivering great work quickly pulls you back into line!

Good tools are key


The nbdev literate programming environment from was super convenient to minimise the project’s development friction. Writing tests as we developed meant that we caught multiple bugs early and auto-generation of docs lends itself immensely to the reproducibility of your code. Most of us will be using this again for our next projects.

Weights & Biases

Weights & Biases generously gave us a team account which enabled us all to log our experiments to a single project. Being directly able to link your runs and results to the final report was really nice. Also it's pretty exciting monitoring 10+ experiments live!


A Discord server worked really well for all our chat and voice communication. Frequent calls to catchup and agree on next steps were super useful. Todo lists and core pieces of code often ended up as pinned messages for quick reference and linking Github activity to a channel was useful for keeping an eye on new commits to the repo.


When it came to writing the final report in latex, Overleaf was a wonderful tool for collaborative editing.


The ReviewNB app on GitHub was very useful for visualizing diffs in notebooks.

Learn from the best

The Reformer architecture had several complex parts, and having Phil Wang's and HuggingFace's Github code was very helpful to understand design decisions and fix issues.

Things we can improve for the next time

Start experiments early

We started our experiments quite late in the project; as we aimed to reimplement Reformer in Pytorch (with reference to existing implementations) about ~90% of our time was spent on ensuring our implementation was faithful to the paper and that it was working correctly. In retrospect starting experiments earlier would have allowed more in depth exploration of what we observed while testing. Full scale experiments have a way of inducing problems you didn’t foresee during the implementation phase...

Task distribution and coordination

When working in a distributed and decentralized team, efficient task allocation and tracking is important. Early in the project todo lists lived in people’s heads, or were quickly buried under 50 chat messages. This was suboptimal for a number of reasons, including that it made involving new people in the project more challenging as they could not easily identify where they could best contribute.

We made a switch to Trello to better track open tasks. It worked reasonably well however its effectiveness was probably proportional to how much time a couple of team members had to review the kanban board, advocate for its use and focus the team’s attention there. The extra friction associated with needing to use another tool unconnected to Github or Discord was probably the reason for why we didn’t use it as much as we could have. Integrating Trello into our workflow or giving Github Projects a trial could have been useful.

More feedback

We had originally intended to get feedback from the fastai community during the project. In the end we were too late in sharing our material, so there wasn’t time for much feedback. Early feedback would have been very useful and the project might have benefited from some periodic summary of accomplishments and current problems. We could have solicited additional feedback from the authors too.

Distributed training

This was our first exposure to distributed training and unfortunately we had a lot of issues with it. We were also unable to log the results from distributed runs properly to Weights & Biases. This slowed down our experiment iteration speed and is why we could not train our models for as long as we would have preferred.

Choice of paper to reproduce

It would have been useful to calculate a rough estimate of the compute budget the paper’s experiments required before jumping into it. In the latter stages of the project we realised that we would be unable to fully replicate some of the paper’s experiments, but instead had to run scaled down versions. In addition, where your interest sits between theoretical and practical papers should be considered when selecting a paper for the challenge.

More tools

We could have tried even more handy tools such as knockknock to alert us when models are finished training and Github Projects for task management.

Some final thoughts

We came out of this project even more motivated compared to how we entered; a great indication that it was both enjoyable and useful for us! Our advice would be to not hesitate to join events like this one and challenge yourself, and try and find one or more other folks in the forums or Discord to work with. After successfully delivering our submission to the challenge we are all eager to work together again on our next project, stay tuned for more!

Thanks for Reading This Far 🙏

As always, I would love to hear your feedback, what could have been written better or clearer, you can find me on twitter: @mcgenergy