Date Tags Papers

New journals appear all the time, and few are worth paying attention to. When the Journal of Open Source Software (JOSS) launched in early May, a few things made it stand out.

First, it's specifically for scientific code, an area that's been problematic for many journals for a long time. Second, it's an overlay journal, in this case piggybacking on GitHub, which is a (fairly) new and interesting development for open scientific publishing. And finally, it suggests that

...we expect that it should take less than an hour to prepare and submit your paper to JOSS.

There's a challenge to be tried.

The paper

So, less than a week after the journal launched, we (Alice Harpole and Ian Hawke) decided to give it a try. Now that our paper has been accepted we can confirm that the total time writing the paper was under an hour: roughly half an hour for the original paper, and twenty minutes for the revision.

What's the trick? Well, it's like running a marathon in under three hours: once you've run the first 26 miles in under two, doing the final 385 yards in under an hour isn't a problem.

The point here, that JOSS makes explicit, is that a "code paper" is, in large part, advertising the existence, documentation, and tests of the code. These are essential things that shouldn't need an academic journal paper: they should be part of good scientific practice. Running the first 26 miles is getting the code and all the essential best practice parts in place: the final 385 yards is a brief paper advertising the steps you took.

The steps in brief

So, what are the steps that we took that made writing the paper so simple?

  1. Have an idea for a scientific code.
  2. Make the code publicly available.
  3. Give the code an open license.
  4. Write tests for the code, using continuous integration, that covers all key parts.
  5. Document the code, with plenty of examples, publicly hosted, being clear about what the code can do, and how fast.
  6. Make sure the code is easy to install.
  7. Release the code through an archiving service.

The steps at length

The idea

This is sometimes the hard part, as most scientists can't get their ideas from the little old lady from Schenectady. In our case, we needed a test case for a different project we were working on. The code rapidly spiralled from a short script to something rather more complex, and became a test case for much of what follows.

Making it public

Cloud based version control repositories are the easiest way to do this for now. We used GitHub pretty much from the start; with GitHub now offering private repositories to academics, it's straightforward to start private and shift public for release.

Give the code an open license

When you create a new repository GitHub asks what license you want to give it. In our case we initially forgot (bad mistake), but updated to the MIT license shortly after submission. Note that JOSS does specify the license of the papers, but you must specify the license for your code in advance.

Write tests

We were working with Python for our code, so could use a standard, and simple, unit testing framework. We chose nose (as one of us had used it before). This ensured that the results matched existing codes in certain simple cases that could be directly compared, and could be used for regression tests in more complex cases.

With a modern testing framework like this, it's very useful to tie the testing into the GitHub version control using continuous integration. For this we used Travis which directly integrates with GitHub. This has the large advantage of rapidly testing the code on many versions of Python, for every commit or pull request. It had the downside of needing some fiddling to make work, as seen by the history of the .travis.yml file.

To check how much of the code we were actually testing, we could link GitHub and Travis to a code coverage checker - here it was codecov.io. Here the key was to work out what we shouldn't worry about testing - in our case, the plotting code was inessential - to focus on getting the science code right. The exclusions are managed through the .coveragerc file.

Document the code

README

The top level README needs to cover a lot of ground. Most important are what the code is meant to do, how to get and install it, how to use it, and how to get support. Installation will come later. In our case, support and contributions is all dealt with by GitHub issues. The purpose of the code should be the focus, particularly as this brief, punchy introduction should closely mirror the content of the paper for JOSS!

Examples

When writing the code we were (and still are) planning on using it to produce tests for another project and paper. This required a broad range of examples, most of which were best illustrated by examples using the code directly. As it's a Python code it had been specifically designed to work within Jupyter notebooks, so that modifications to examples could be rapidly displayed. The documentation was originally written directly into Jupyter notebooks with the intention of directly converting the text and figures into a later paper.

This had a specific advantage when producing documentation for the code: Jupyter notebooks are now so popular that there's many tools that directly convert them into useful online forms. The first, central tool is sphinx, which is used to generate documentation for many projects, particularly those based on Python. Once sphinx was setup, we could then use nbsphinx, which directly converts Jupyter notebooks to sphinx documentation. This means creating documentation with up-to-date examples is simple: write the code and accompanying text into a Jupyter notebook, and nbsphinx will run it automatically to produce the documentation.

To clearly host the documentation online there is another key integration with GitHub: Read the Docs. If you have sphinx documentation in a GitHub project then Read the Docs will host it, giving automatically generated, clean HTML documentation, with the option to generate PDF and epub documents if required. Just to note: the documentation on Read the Docs was something the reviewer particularly liked.

However, there are a few pain points that should be noted here. A minor one is the inclusion of mathematics in the documentation: to make this show up cleanly in the epub version requires using the mathpng extension, (see eg this commit for this set of documentation) which doesn't display as nicely online. A more important issue is that getting the documentation to build on Read the Docs can be very painful: we still do not have a robust, reliable build process, as it often "randomly" fails.

Code API

Finally, remember to document the internals of the code. This was another place we slipped up initially, but is particularly important when people want to contribute and extend your code. Sticking to a convention is useful, as it usually means that IDEs can find and nicely format your internal documentation. We chose numpydoc, following standard examples, as we've used it in the past.

Installation

There is nothing as frustrating as finding a code that may solve your problem, but having to spend hours or days trying to install it just to test it out. For plain Python code it's usually fairly simple to test something out, but it can be made nearly seamless by using PyPI. There are many example tutorials of how to upload your package to PyPI, and it's definitely worth the time to experiment with this to ease the installation pain for your users, and for yourself.

Once you've decided on an installation route, make sure to document this clearly in the top level README and the start of the documentation. It's important to get the dependencies in there!

Release the code

Code changes to fix bugs and to keep track of changes in its dependencies. However, a scientific code must be reproducible, so the original version associated with the paper must be stored. With GitHub this can be done via an integration with Zenodo, by creating a release. In addition to giving you a fixed, archived version of the code, it also gives you a DOI to refer to it by.

Summary

There's a lot going on in scientific publishing right now, and a lot of questions about how papers and journals of the future should look. An important question is whether the requirements and incentives of producing papers for "standard" journals are aligned for modern scientific best practices. This is an area where JOSS stands out: it has explicitly set out to reward good scientific coding practice with the key incentive - a publication. It will be interesting to see how this will evolve, and if journals in other areas can follow suit.


Comments

comments powered by Disqus