15  Best Practices and Reproducibility

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Organise a research project using a consistent directory structure
  • Use renv to create a reproducible package environment
  • Implement basic Git version control from within RStudio
  • Write clean, readable R code following community conventions
  • Apply the principles of a reproducible research workflow

15.1 Project-Oriented Workflows

The foundation of reproducible research is a self-contained project. Every file needed to reproduce an analysis should live in one directory, with no hard-coded paths to files elsewhere on your computer.

15.1.1 Recommended Directory Structure

my-research-project/
├── _quarto.yml          # (if a Quarto book/website)
├── my-project.Rproj     # RStudio project file
├── renv.lock            # Package snapshot (managed by renv)
├── README.md            # Project overview
│
├── data/
│   ├── raw/             # Original, immutable data — never modify
│   └── processed/       # Cleaned, derived data
│
├── R/                   # Reusable R functions
│   ├── utils.R
│   └── plotting.R
│
├── analysis/            # Analysis scripts or .qmd files
│   ├── 01-clean.qmd
│   ├── 02-eda.qmd
│   └── 03-modelling.qmd
│
├── output/
│   ├── figures/
│   ├── tables/
│   └── reports/
│
└── docs/                # Rendered output (for GitHub Pages)
ImportantThe Two Rules of Raw Data
  1. Never modify raw data. Put it in data/raw/ and treat it as read-only.
  2. Always document its provenance. Add a data/raw/README.md recording where the data came from, when it was downloaded, and what it contains.

15.1.2 The here Package Revisited

Code
library(here)

# Paths always relative to the .Rproj location
raw_data_path    <- here("data", "raw", "crop_yields_2023.csv")
clean_data_path  <- here("data", "processed", "crop_yields_clean.rds")
figure_path      <- here("output", "figures", "yield_trend.png")

# These paths work on any machine, any OS
cat(raw_data_path, "\n")
#> /home/pawan/statistics-with-R/data/raw/crop_yields_2023.csv

15.1.3 Naming Conventions

Consistent file and variable naming pays dividends in readability:

# Files: use lowercase with hyphens or underscores, be descriptive
# Good:
01-data-cleaning.qmd
02-eda-wheat-prices.qmd
fig-yield-by-state.png

# Bad:
analysis.R
Final_FINAL_v3.qmd
myfig.png

# Objects: use snake_case consistently
# Good:
wheat_yield_2023 <- read_csv(...)
mean_yield_state <- ...

# Bad:
WheatYield <- ...
meanyield.2023 <- ...
x <- ...

15.2 Version Control with Git

Git tracks every change to your files, allows you to revert to any previous state, and enables collaboration. GitHub hosts Git repositories online.

15.2.1 Basic Git Workflow

# Initialise a new repository
git init

# Check status
git status

# Stage files for commit
git add .                         # Stage all changed files
git add analysis/01-clean.qmd     # Stage one file

# Commit with a message
git commit -m "Add data cleaning script for wheat yields"

# View history
git log --oneline

# Push to GitHub
git remote add origin https://github.com/username/my-project.git
git push -u origin main

15.2.2 Branching

# Create a new branch for an experiment
git checkout -b feature/add-poisson-model

# Make changes, commit...
git commit -m "Add Poisson regression for hospital admissions"

# Merge back to main
git checkout main
git merge feature/add-poisson-model

15.2.3 Git in RStudio

RStudio provides a GUI for Git:

  1. Tools → Version Control → Project Setup — enables Git for a project
  2. The Git pane (Environment panel, Git tab) shows changed files
  3. Click Diff to see exactly what changed
  4. Check boxes to Stage, then click Commit
  5. Use Pull before starting work; Push when done
TipWhat to Put in .gitignore

Some files should not be tracked. Create a .gitignore file:

# R artefacts
.Rhistory
.RData
.Rproj.user/

# Large data files (share via cloud or DOI instead)
data/raw/*.csv
data/raw/*.xlsx

# Rendered outputs (can be regenerated)
_book/
docs/
*.pdf

# renv package library (renv.lock is enough)
renv/library/

15.3 Package Management with renv

renv creates a project-specific package library and records exact package versions, so anyone (including future-you) can reproduce your exact environment.

Code
library(renv)

# 1. Initialise renv in your project
renv::init()
# Creates: renv/ directory, renv.lock, .Rprofile

# 2. Install packages as usual
install.packages("gapminder")

# 3. Snapshot the current state
renv::snapshot()
# Updates renv.lock with all current package versions

# 4. On a new machine or after cloning from GitHub
renv::restore()
# Installs exactly the packages recorded in renv.lock

The renv.lock file looks like this:

{
  "R": {
    "Version": "4.3.2",
    "Repositories": [{"Name": "CRAN", "URL": "https://cloud.r-project.org"}]
  },
  "Packages": {
    "ggplot2": {
      "Package": "ggplot2",
      "Version": "3.4.4",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "..."
    }
  }
}
TipCommit renv.lock, not renv/library/

Add renv.lock to Git to share the package snapshot. Add renv/library/ to .gitignore — it is large and platform-specific.

15.4 Writing Clean R Code

15.4.1 The tidyverse Style Guide

Code
# Spacing: spaces around operators and after commas
x <- 1 + 2          # Good
x<-1+2              # Bad

# Line length: max ~80 characters; break long pipes
result <- my_data |>
  filter(year == 2023) |>
  group_by(state) |>
  summarise(mean_yield = mean(yield, na.rm = TRUE))

# Not:
result <- my_data |> filter(year == 2023) |> group_by(state) |> summarise(mean_yield = mean(yield, na.rm = TRUE))

# Comments: explain WHY, not WHAT
# Compute yield index relative to national average
yield_index <- state_yield / national_avg_yield   # Good comment

x <- state_yield / national_avg_yield   # divide state by national avg   # Bad comment

# Function naming: use verbs
compute_yield_index()   # Good
yield_index_function()  # Bad
yi()                    # Bad

15.4.2 Using lintr and styler

Code
# lintr: check code style automatically
install.packages("lintr")
lintr::lint("analysis/01-clean.R")    # Check one file
lintr::lint_dir("R/")                 # Check all files in a directory

# styler: automatically reformat code to tidyverse style
install.packages("styler")
styler::style_file("analysis/01-clean.R")
styler::style_dir("analysis/")

15.5 A Complete Reproducible Workflow

Putting it all together, a fully reproducible analysis follows this sequence:

Code
# 1. PROJECT SETUP (once)
# - Create RStudio Project
# - Initialise Git: git init
# - Initialise renv: renv::init()
# - Create directory structure

# 2. DATA ACQUISITION (data/raw/ — never modified)
# - Download/receive raw data
# - Document provenance in data/raw/README.md
# - Commit: git commit -m "Add raw data files"

# 3. DATA CLEANING (analysis/01-clean.qmd)
library(tidyverse)
library(here)
library(janitor)

raw <- read_csv(here("data", "raw", "crop_yields_raw.csv"),
                na = c("", "NA", "N/A", "--"))
clean <- raw |>
  clean_names() |>
  filter(!is.na(state)) |>
  mutate(year = as.integer(year))

saveRDS(clean, here("data", "processed", "crop_yields_clean.rds"))

# 4. ANALYSIS (analysis/02-eda.qmd, 03-model.qmd, ...)
# - Load from data/processed/
# - All outputs go to output/

# 5. SNAPSHOT AND COMMIT
# renv::snapshot()
# git add .
# git commit -m "Add EDA chapter with yield distributions"

# 6. PUBLISH
# quarto render
# quarto publish gh-pages

15.6 The Turing Test for Reproducibility

A useful heuristic: could a stranger reproduce your analysis from scratch? More concretely:

  1. Clone your GitHub repository onto a fresh computer
  2. Run renv::restore() to install packages
  3. Run quarto render or execute scripts in order
  4. Do you get identical results?

If yes, your work is reproducible. If no, something is missing.

ImportantThe Three Enemies of Reproducibility
  1. Hard-coded paths: read_csv("C:/Users/pawan/Desktop/data.csv") — use here() instead
  2. Random numbers without set.seed(): results change every run
  3. Undocumented package versions: install.packages("ggplot2") installs the current version, which changes over time — use renv

15.7 Exercises

  1. Create a new RStudio Project with the recommended directory structure. Initialise renv and install three packages. Take a snapshot. List what is recorded in renv.lock.

  2. Initialise Git in your project. Make three commits — one for the project structure, one after adding a data file, and one after writing an analysis script. View the log with git log --oneline.

  3. Apply lintr::lint_dir() to any existing R script you have written. Fix at least three style issues it identifies.

  4. Write a README.md for a hypothetical project. It should include: project title, objective, data sources, how to reproduce the analysis, and required R version.

  5. Challenge: Take any analysis you have done in a previous chapter and make it fully reproducible: clean directory structure, renv.lock, .gitignore, parameterised Quarto report, and published to GitHub. Share the URL.