Git & GitHub Tutorial for Scientists:
It’s Not Only for Programmers
Preface
What is this Git/GitHub Tutorial about?
The goal of this tutorial is to help scientists with no formal programming background to (1) start using git locally for version-control of your code, and (2) begin using GitHub to share your code and collaborate with others.
Git for version control | GitHub for sharing your code |
---|---|
Note: If we try to wait until we have perfect codes to share, more likely than not, we will end up never sharing them.
Why is git important for scientists?
Git facilitates (1) documentation, and (2) sharing/collaborating. Both of these are important in science.
I. Version-control for code = the lab notebook of experiments
In scientific experiments, we are trained to record every parameter that we modified and tested, as it ensures consistency within an experiment and facilitates reproducible results. While we often record the experimental part of our experiments meticulously, the preprocessing and analyzing of data are often ‘assumed’ to be recorded in our scripts.
Here are a few scenarios where we wish a detailed history of how codes were developed is available:
Scenario A: Things were working, now they are not!
You have 5 people collaborating on a single script. That script now results in an error after someone changed a few lines over the weekend. They have no recollection what was changed. ¯\_(._.)_/¯
Scenario B: Why were changes made?
You joined a new lab, and were given a pipeline/script to modify that dated years back. The script is named cool_script_version25000.sh, with few comments. Whoever made it has quit academia and is traveling the world, soul searching, and won’t answer emails. ¯\_(._.)_/¯
Scenario C: We made changes a long time ago…
Collaborator BigName wants to know what preprocessing parameters were used in your manuscript from 5 years ago. You have since changed multiple parameters in your default processing pipeline. ¯\_(._.)_/¯
Saving something as Final.doc is bad…but Final_script.R is just as bad (Image Credit: PhD Comic)
II. Git and Remote Hosts (e.g., GitHub) makes sharing/collaborating easier
Facilitates new collaboration
If you found a way to speed up some toolbox, Git/GitHub helps you to suggest this change to the author of the toolbox, even if you don’t know each other.
Encourages open source and open science
Open source is typically a term used in software development, where it means the source code are open to the public. People can freely look at how a program was written, make improvements with them, etc.
In science, the idea of open source is closely linked with open science, a movement to make data, samples, software and all things related to a scientific finding as transparent and easy to access as possible. There are pros and cons to an open and closed system, which is not a topic that can be adequately covered here (for more info, checkout Center for Open Science).
However, given a common goal of wanting more people to understand, reproduce, and build on top of previous work, an ‘open’ approach to your code is certainly a good place to start. Git/GitHub will help facilitate your code being shared publicly. It also enables you to easily contribute to other projects, and incorporate ideas and contribution from others in a systematic way (even strangers!).
Acknowledgment
The creation of this tutorial would not be possible without Software Carpentry’s git novice course. Check them out!
Other helpful links
- GitHub Education provides additional features to students & researchers for free (e.g., Pro account with unlimited collaborators)!
- Git the Simple Guide provides all the necessary commands to use git.
How to read the book?
All codes/commands and its output are in gray boxes. However, the difference is that the output will always immediately follow the codes/commands and will start with ##
.
For example, if we wanted to run the command ls
to see what files are in the current directory.
Below is the code
## DESCRIPTION
## README.md
## Rmd
## _bookdown.yml
## _output.yml
## docs
## git_github_bookdown.Rproj
## git_github_workshop.Rmd
## google_analytics.html
## google_tag_body.html
## google_tag_header.html
## img
## index.Rmd
Above is the output