Setting Up Your R Environment for Reproducible Science
I’ve recently had the pleasure of reviewing several excellent manuscripts where the authors have gone the extra mile, making their data and analysis code publicly available on platforms like the Open Science Framework. This commitment to open science is fantastic and a crucial step towards greater transparency in research. However, a common pitfall often prevents this good practice from being truly effective: the lack of a reproducible R environment.
Many researchers, myself included at times, fall into the habit of using setwd() or relying on a specific local directory structure. While this works perfectly well for our own machines, it creates a significant roadblock for anyone trying to reproduce our work. When a reviewer, collaborator, or future researcher tries to run the code, it fails immediately because their file paths are different.
Worse still, as I’ve seen in one particular case, leaving directory paths with your name or institutional information can inadvertently de-anonymise a submission during a blind peer-review process. This can compromise the integrity of the review and is an entirely avoidable issue.
If you’re still not convinced, consider this: reproducible code makes your own life easier. It will significantly reduce the headaches if you ever need to switch between Windows and macOS. This is because Windows uses backslashes () for file paths, while macOS and Linux use forward slashes (/). A path copied from Windows Explorer will break on another system. By using methods like the here() package or relative paths, your code uses the universal forward slash, automatically resolving these dreaded back-slash issues. Making your code portable and functional “out of the box” doesn’t just save your collaborators from troubleshooting—it ensures your future self won’t have to either.
With that laundry list of reasons I thought I’d share how I started working, but you can also watch this useful video if a demonstration is better (https://youtu.be/StqDYjM6ULo?si=lCGSFP7NREf7lZdN). This rest of this post will explain how to set up your R project with parallel pathing, which makes your code portable and robust. The core idea is to rely on relative paths, not absolute ones. This means that your code will find files based on their location relative to the project’s root folder, rather than a fixed location on your hard drive.
Here’s a simple, step-by-step guide to get you started:
Step 1: Use RStudio Projects
The easiest way to manage this is by using RStudio Projects. When you open a new project, RStudio creates an .Rproj file. This file tells R that the directory it resides in is the “root” of your project. All your scripts, data, and outputs should be organised within this project folder.
Step 2: Avoid setwd()
With an RStudio Project, you no longer need to use setwd(). RStudio automatically sets your working directory to the project’s root when you open it. This single change eliminates the primary source of irreproducibility.
Step 3: Structuring Your Project
A common and effective project structure looks something like this:
MyProject/
├── data/
│ ├── raw_data.csv
│ └── cleaned_data.csv
├── scripts/
│ ├── 01_data_cleaning.R
│ └── 02_analysis.R
├── outputs/
│ └── my_plot.png
├── README.md
└── MyProject.Rproj
Step 4: Choose Your Pathing Method
You now have two primary methods for referencing files reproducibly. Both are better than setwd(), but each has its own pros and cons.
Method 1: The here Package (Recommended)
To make things truly bulletproof, I highly recommend using the here package. The here() function intelligently builds file paths starting from the root of your project, regardless of where your current R script is located.
Instead of writing:
# This is a bad idea
data <- read.csv("C:/Users/YourName/Documents/MyProject/data/raw_data.csv")Or even:
# This might break if the script is moved
data <- read.csv("data/raw_data.csv")You can write:
# The reproducible way
library(here)
data <- read.csv(here("data", "raw_data.csv"))The here() function combines the folder names into a correct file path for your operating system, ensuring that the code will work seamlessly for anyone, anywhere, who has the project folder downloaded. My rationale for recommending this approach is that it is the most robust and least prone to error, especially as projects grow in complexity.
Method 2: Relative Paths with ../
An alternative, which doesn’t require an additional package, is to use standard relative pathing. The ../ syntax means “move up one directory level”. If your script is in the scripts folder and you want to access a file in the data folder, you can go up a level to the project root and then down into the data folder.
For example, a script in scripts/ could access data with this code:
# The relative pathing way
data <- read.csv("../data/raw_data.csv")This method is simple and effective for many projects. However, a key caveat is that if you move the script to a different location within your project (e.g., to a new subdirectory scripts/analysis/), this path will break because the relative relationship has changed. This is why the here package is often a more reliable choice, as it always builds paths from the project root.
By adopting one of these practices, you ensure that your code is not just available, but truly reproducible. You remove a significant barrier for those trying to understand and build upon your work, and you protect yourself from accidental deanonymisation during the review process. It’s a small change that makes a big difference for the entire research community.