Reproducible script execution with DataLad
This tutorial was created by Sin Kim.
Github: @kimsin98
Twitter: @SinKim98
Getting Setup with Neurodesk
For more information on getting set up with a Neurodesk environment, see hereIn addition to being a convenient method of sharing data, DataLad can also help you create reproducible analyses by recording how certain result files were produced (i.e. provenance). This helps others (and you!) easily keep track of analyses and rerun them.
This tutorial will assume you know the basics of navigating the terminal. If you are not familiar with the terminal at all, check the DataLad Handbook’s brief guide.
Create a DataLad project
A DataLad dataset can be any collection of files in folders, so it could be many things including an analysis project. Let’s go to the Neurodesktop storage and create a dataset for some project. Open a terminal and enter these commands:
yoda?
-c yoda
option configures the dataset according to
the YODA, a
set of intuitive organizational principles for data analyses that works
especially well with version control.Go in the dataset and check its contents.
Create a script
One of DataLad’s strengths is that it assumes very little about your datasets. Thus, it can work with any other software on the terminal: Python, R, MATLAB, AFNI, FSL, FreeSurfer, etc. For this tutorial, we will run the simplest Julia script.
EOF?
For sake of demonstration, we create the script using built-in Bash terminal commands only (here document that starts after<< EOF
and ends when you enter EOF
), but you may use whatever text editor you are
most comfortable with to create the code/hello.jl
file.You may want to test (parts of) your script.
Run and record
Before you run your analyses, you should check the dataset for changes and save or clean them.
git
git clean
is for removing new, untracked files. For
resetting existing, modified files to the last saved version, you would need
git reset --hard
.When the dataset is clean, we are ready to datalad run
!
Let’s go over each of the arguments:
-m 'run hello'
: Human-readable message to record in the dataset log.-o 'outputs/hello.txt'
: Expected output of the script. You can specify multiple-o
arguments and/or use wildcards like'outputs/*'
. This script has no inputs, but you can similarly specify inputs with-i
.'julia ... '
: The final argument is the command that DataLad will run.
Before getting to the exciting part, let’s do a quick sanity check.
View history and rerun
So what’s so good about the extra hassle of running scripts with datalad run
?
To see that, you will need to pretend you are someone else (or you of future!)
and install the dataset somewhere else. Note that -s
argument is probably a
URL if you were really someone else.
Because a DataLad dataset is a Git repository, people who download your dataset
can see exactly how outputs/hello.txt
was created using Git’s logs.
Then, using that information, they can re-run the command that created the file
using datalad rerun
!
git
In Git, each commit (save state) is assigned a long, unique machine-generated ID.52cf
refers to the commit with ID that starts
with those characters. Usually 4 is the minimum needed to uniquely identify a
commit. Of course, this ID is probably different for you, so change this
argument to match your commit.See Also
- To learn more basics and advanced applications of DataLad, check out the DataLad Handbook.
- DataLad is built on top of the popular version control tool Git. There are many great resources on Git online, like this free book.
- DataLad is only available on the terminal. For a detailed introduction on the Bash terminal, check the BashGuide.
- For even more reproducibility, you can include containers with your dataset to run analyses in. DataLad has an extension to support script execution in containers. See here.