Appendix C: Common R Packages for Data Science

This appendix catalogs the R packages used throughout this book, organised by task, with brief descriptions and installation commands.

Installing All Book Packages

# Run this once to install every package used in this book
pkgs <- c(
  # Core tidyverse
  "tidyverse", "tibble", "dplyr", "tidyr", "readr",
  "ggplot2", "purrr", "stringr", "forcats", "lubridate",

  # Data import
  "readxl", "haven", "jsonlite", "googlesheets4",
  "writexl", "DBI", "RSQLite", "RPostgres",

  # Data cleaning
  "janitor", "skimr", "naniar",

  # Visualisation
  "plotly", "corrplot", "GGally", "ggfortify",
  "patchwork", "scales", "RColorBrewer", "viridis",
  "ggthemes", "ggrepel",

  # Statistics
  "moments", "effectsize", "car", "multcomp", "emmeans",

  # Regression and modelling
  "broom", "modelsummary", "gtsummary", "gt",
  "lme4", "broom.mixed", "MASS", "performance",

  # Multivariate
  "factoextra", "cluster", "FactoMineR",

  # Time series
  "forecast", "tseries", "tsibble", "feasts", "fable",

  # Machine learning
  "caret", "tidymodels",

  # Functional programming / utilities
  "purrr", "furrr", "bench", "microbenchmark",
  "here", "renv", "fs", "glue",

  # Quarto / reporting
  "knitr", "rmarkdown", "quarto",

  # Debugging / style
  "lintr", "styler",

  # Datasets
  "gapminder", "palmerpenguins", "nycflights13"
)

# Install packages not already installed
new_pkgs <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(new_pkgs)) install.packages(new_pkgs)

Package Catalogue

Data Import and Export

Table 1: Packages for data import and export.

Package	Purpose	Install
readr	Read CSV, TSV, and delimited text files	tidyverse
readxl	Read Excel files (.xls, .xlsx) without Java	tidyverse
haven	Read SPSS (.sav), Stata (.dta), SAS (.sas7bdat) files	install.packages(‘haven’)
writexl	Write data frames to Excel without Java	install.packages(‘writexl’)
jsonlite	Parse and generate JSON	install.packages(‘jsonlite’)
DBI	Unified database interface	install.packages(‘DBI’)
RSQLite	Interface to SQLite databases	install.packages(‘RSQLite’)
RPostgres	Interface to PostgreSQL databases	install.packages(‘RPostgres’)
googlesheets4	Read and write Google Sheets	install.packages(‘googlesheets4’)
httr2	HTTP requests for web APIs	install.packages(‘httr2’)

Data Wrangling

Table 2: Packages for data manipulation and cleaning.

Package	Purpose
dplyr	Core data manipulation: filter, select, mutate, summarise
tidyr	Reshape data: pivot_longer, pivot_wider, separate, unite
data.table	High-performance data manipulation for large datasets
janitor	Clean data frame names and values (clean_names)
skimr	Rich summary statistics with skim()
lubridate	Parse, manipulate, and arithmetic on dates and times
stringr	Consistent string manipulation functions
forcats	Tools for working with categorical (factor) variables
naniar	Visualise and analyse missing data patterns
validate	Validate data against rules

Visualisation

Table 3: Packages for data visualisation.

Package	Purpose
ggplot2	Grammar of Graphics — the core visualisation package
plotly	Interactive charts; converts ggplot2 with ggplotly()
patchwork	Combine multiple ggplot2 plots with / and \|
corrplot	Visualise correlation matrices
GGally	Pairs plots, parallel coordinates, and ggplot2 extensions
ggrepel	Non-overlapping text labels for scatter plots
ggthemes	Additional themes (Economist, Tufte, FiveThirtyEight, …)
viridis	Perceptually uniform colour palettes (colourblind-safe)
scales	Format axis labels (dollar, percent, comma, …)
ggfortify	autoplot() for time series, PCA, survival objects
leaflet	Interactive choropleth and point maps

Statistical Modelling

Table 4: Packages for statistical modelling and inference.

Package	Purpose
stats	Base R statistics: lm, glm, t.test, aov, etc.
car	Companion to Applied Regression: ANOVA, VIF, leveneTest
multcomp	Multiple comparisons and simultaneous inference
emmeans	Estimated marginal means for factorial designs
lme4	Linear and generalised linear mixed models
MASS	Negative binomial regression, stepwise selection, LDA
survival	Survival analysis: Kaplan-Meier, Cox regression
nlme	Mixed effects models with correlation structures
mgcv	Generalised additive models (GAMs)
glmnet	LASSO, Ridge, and Elastic Net regression
broom	Tidy model outputs: tidy, glance, augment
modelsummary	Publication-quality regression tables
gtsummary	Clinical summary tables with gtsummary
effectsize	Standardised effect sizes (Cohen’s d, eta-squared, …)
performance	Model quality indices: R², AIC, ICC, RMSE
see	Visualisation for easystats packages

Time Series

Table 5: Packages for time series analysis and forecasting.

Package	Purpose
forecast	ARIMA, ETS, Holt-Winters forecasting; auto.arima()
tseries	Unit root tests (adf.test), ARCH effects
tsibble	Tidy time series data structure
feasts	Feature extraction and visualisation for tsibble
fable	Tidy forecasting framework (ARIMA, ETS, …)
prophet	Facebook’s additive forecasting model (trends + seasonality)
xts	Extensible time series for irregular data
zoo	Ordered observations for irregular time series

Reproducibility and Workflow

Table 6: Packages for reproducible research workflows.

Package	Purpose
renv	Reproducible package environments with lockfile
here	File paths relative to project root
targets	Make-style pipeline for complex, cached workflows
quarto	Render Quarto documents from R
knitr	R code chunks in documents; kable() for tables
rmarkdown	Dynamic documents with R Markdown
lintr	Static code analysis and style checking
styler	Automatic code reformatting to tidyverse style
usethis	Automate project and package setup tasks
devtools	Package development tools

Datasets

Table 7: Packages that provide datasets for learning and examples.

Package	Key Datasets
gapminder	gapminder: GDP, life expectancy, population for 142 countries, 1952–2007
palmerpenguins	penguins: measurements for 344 penguins of 3 species
nycflights13	flights, airlines, airports, planes, weather (2013)
ISLR2	Auto, Boston, Caravan, Wage, and many others (ISLR textbook)
wooldridge	100+ datasets for Wooldridge’s Econometrics textbook
AER	CPS1985, Fatalities, PSID, and many applied econometrics datasets
datasets	mtcars, iris, airquality, USArrests, etc. (built into R)

Package Discovery

Finding the right package for a new task:

CRAN Task Views: cran.r-project.org/web/views/ — curated lists by topic (Econometrics, TimeSeries, Spatial, …)
rOpenSci: ropensci.org — peer-reviewed packages for scientific data
Posit Community: community.rstudio.com
R-bloggers: r-bloggers.com
Stack Overflow: tag [r] for questions

Getting Package Help

# Built-in help
?dplyr::filter
help(package = "ggplot2")
vignette("dplyr")                  # Package tutorial
vignette(package = "tidyr")        # List all vignettes

# Check package version
packageVersion("ggplot2")

# List all functions in a package
ls("package:dplyr")

# Package news (changelog)
news(package = "ggplot2")

# Appendix C: Common R Packages for Data Science {#sec-appendix-c .unnumbered} This appendix catalogs the R packages used throughout this book, organised by task, with brief descriptions and installation commands. ## Installing All Book Packages {#sec-install-all .unnumbered} ```r # Run this once to install every package used in this book pkgs <- c( # Core tidyverse "tidyverse", "tibble", "dplyr", "tidyr", "readr", "ggplot2", "purrr", "stringr", "forcats", "lubridate", # Data import "readxl", "haven", "jsonlite", "googlesheets4", "writexl", "DBI", "RSQLite", "RPostgres", # Data cleaning "janitor", "skimr", "naniar", # Visualisation "plotly", "corrplot", "GGally", "ggfortify", "patchwork", "scales", "RColorBrewer", "viridis", "ggthemes", "ggrepel", # Statistics "moments", "effectsize", "car", "multcomp", "emmeans", # Regression and modelling "broom", "modelsummary", "gtsummary", "gt", "lme4", "broom.mixed", "MASS", "performance", # Multivariate "factoextra", "cluster", "FactoMineR", # Time series "forecast", "tseries", "tsibble", "feasts", "fable", # Machine learning "caret", "tidymodels", # Functional programming / utilities "purrr", "furrr", "bench", "microbenchmark", "here", "renv", "fs", "glue", # Quarto / reporting "knitr", "rmarkdown", "quarto", # Debugging / style "lintr", "styler", # Datasets "gapminder", "palmerpenguins", "nycflights13" ) # Install packages not already installed new_pkgs <- pkgs[!pkgs %in% installed.packages()[, "Package"]] if (length(new_pkgs)) install.packages(new_pkgs) ``` --- ## Package Catalogue {#sec-pkg-catalogue .unnumbered} ### Data Import and Export {#sec-pkg-import .unnumbered} ```{r} #| label: tbl-import-pkgs #| echo: false #| tbl-cap: "Packages for data import and export." import_pkgs <- data.frame( Package = c("readr", "readxl", "haven", "writexl", "jsonlite", "DBI", "RSQLite", "RPostgres", "googlesheets4", "httr2"), Function = c( "Read CSV, TSV, and delimited text files", "Read Excel files (.xls, .xlsx) without Java", "Read SPSS (.sav), Stata (.dta), SAS (.sas7bdat) files", "Write data frames to Excel without Java", "Parse and generate JSON", "Unified database interface", "Interface to SQLite databases", "Interface to PostgreSQL databases", "Read and write Google Sheets", "HTTP requests for web APIs" ), Install = c( "tidyverse", "tidyverse", "install.packages('haven')", "install.packages('writexl')", "install.packages('jsonlite')", "install.packages('DBI')", "install.packages('RSQLite')", "install.packages('RPostgres')", "install.packages('googlesheets4')", "install.packages('httr2')" ) ) knitr::kable(import_pkgs, col.names = c("Package", "Purpose", "Install")) ``` ### Data Wrangling {#sec-pkg-wrangle .unnumbered} ```{r} #| label: tbl-wrangle-pkgs #| echo: false #| tbl-cap: "Packages for data manipulation and cleaning." wrangle_pkgs <- data.frame( Package = c("dplyr", "tidyr", "data.table", "janitor", "skimr", "lubridate", "stringr", "forcats", "naniar", "validate"), Purpose = c( "Core data manipulation: filter, select, mutate, summarise", "Reshape data: pivot_longer, pivot_wider, separate, unite", "High-performance data manipulation for large datasets", "Clean data frame names and values (clean_names)", "Rich summary statistics with skim()", "Parse, manipulate, and arithmetic on dates and times", "Consistent string manipulation functions", "Tools for working with categorical (factor) variables", "Visualise and analyse missing data patterns", "Validate data against rules" ) ) knitr::kable(wrangle_pkgs, col.names = c("Package", "Purpose")) ``` ### Visualisation {#sec-pkg-viz .unnumbered} ```{r} #| label: tbl-viz-pkgs #| echo: false #| tbl-cap: "Packages for data visualisation." viz_pkgs <- data.frame( Package = c("ggplot2", "plotly", "patchwork", "corrplot", "GGally", "ggrepel", "ggthemes", "viridis", "scales", "ggfortify", "leaflet"), Purpose = c( "Grammar of Graphics — the core visualisation package", "Interactive charts; converts ggplot2 with ggplotly()", "Combine multiple ggplot2 plots with / and |", "Visualise correlation matrices", "Pairs plots, parallel coordinates, and ggplot2 extensions", "Non-overlapping text labels for scatter plots", "Additional themes (Economist, Tufte, FiveThirtyEight, ...)", "Perceptually uniform colour palettes (colourblind-safe)", "Format axis labels (dollar, percent, comma, ...)", "autoplot() for time series, PCA, survival objects", "Interactive choropleth and point maps" ) ) knitr::kable(viz_pkgs, col.names = c("Package", "Purpose")) ``` ### Statistical Modelling {#sec-pkg-models .unnumbered} ```{r} #| label: tbl-model-pkgs #| echo: false #| tbl-cap: "Packages for statistical modelling and inference." model_pkgs <- data.frame( Package = c("stats", "car", "multcomp", "emmeans", "lme4", "MASS", "survival", "nlme", "mgcv", "glmnet", "broom", "modelsummary", "gtsummary", "effectsize", "performance", "see"), Purpose = c( "Base R statistics: lm, glm, t.test, aov, etc.", "Companion to Applied Regression: ANOVA, VIF, leveneTest", "Multiple comparisons and simultaneous inference", "Estimated marginal means for factorial designs", "Linear and generalised linear mixed models", "Negative binomial regression, stepwise selection, LDA", "Survival analysis: Kaplan-Meier, Cox regression", "Mixed effects models with correlation structures", "Generalised additive models (GAMs)", "LASSO, Ridge, and Elastic Net regression", "Tidy model outputs: tidy, glance, augment", "Publication-quality regression tables", "Clinical summary tables with gtsummary", "Standardised effect sizes (Cohen's d, eta-squared, ...)", "Model quality indices: R², AIC, ICC, RMSE", "Visualisation for easystats packages" ) ) knitr::kable(model_pkgs, col.names = c("Package", "Purpose")) ``` ### Time Series {#sec-pkg-ts .unnumbered} ```{r} #| label: tbl-ts-pkgs #| echo: false #| tbl-cap: "Packages for time series analysis and forecasting." ts_pkgs <- data.frame( Package = c("forecast", "tseries", "tsibble", "feasts", "fable", "prophet", "xts", "zoo"), Purpose = c( "ARIMA, ETS, Holt-Winters forecasting; auto.arima()", "Unit root tests (adf.test), ARCH effects", "Tidy time series data structure", "Feature extraction and visualisation for tsibble", "Tidy forecasting framework (ARIMA, ETS, ...)", "Facebook's additive forecasting model (trends + seasonality)", "Extensible time series for irregular data", "Ordered observations for irregular time series" ) ) knitr::kable(ts_pkgs, col.names = c("Package", "Purpose")) ``` ### Reproducibility and Workflow {#sec-pkg-repro .unnumbered} ```{r} #| label: tbl-repro-pkgs #| echo: false #| tbl-cap: "Packages for reproducible research workflows." repro_pkgs <- data.frame( Package = c("renv", "here", "targets", "quarto", "knitr", "rmarkdown", "lintr", "styler", "usethis", "devtools"), Purpose = c( "Reproducible package environments with lockfile", "File paths relative to project root", "Make-style pipeline for complex, cached workflows", "Render Quarto documents from R", "R code chunks in documents; kable() for tables", "Dynamic documents with R Markdown", "Static code analysis and style checking", "Automatic code reformatting to tidyverse style", "Automate project and package setup tasks", "Package development tools" ) ) knitr::kable(repro_pkgs, col.names = c("Package", "Purpose")) ``` ### Datasets {#sec-pkg-data .unnumbered} ```{r} #| label: tbl-data-pkgs #| echo: false #| tbl-cap: "Packages that provide datasets for learning and examples." data_pkgs <- data.frame( Package = c("gapminder", "palmerpenguins", "nycflights13", "ISLR2", "wooldridge", "AER", "datasets"), Key_Datasets = c( "gapminder: GDP, life expectancy, population for 142 countries, 1952–2007", "penguins: measurements for 344 penguins of 3 species", "flights, airlines, airports, planes, weather (2013)", "Auto, Boston, Caravan, Wage, and many others (ISLR textbook)", "100+ datasets for Wooldridge's Econometrics textbook", "CPS1985, Fatalities, PSID, and many applied econometrics datasets", "mtcars, iris, airquality, USArrests, etc. (built into R)" ) ) knitr::kable(data_pkgs, col.names = c("Package", "Key Datasets")) ``` --- ## Package Discovery {#sec-pkg-discovery .unnumbered} Finding the right package for a new task: - **CRAN Task Views**: [cran.r-project.org/web/views/](https://cran.r-project.org/web/views/) — curated lists by topic (Econometrics, TimeSeries, Spatial, ...) - **rOpenSci**: [ropensci.org](https://ropensci.org) — peer-reviewed packages for scientific data - **Posit Community**: [community.rstudio.com](https://community.rstudio.com) - **R-bloggers**: [r-bloggers.com](https://www.r-bloggers.com) - **Stack Overflow**: tag `[r]` for questions --- ## Getting Package Help {#sec-pkg-help .unnumbered} ```r # Built-in help ?dplyr::filter help(package = "ggplot2") vignette("dplyr") # Package tutorial vignette(package = "tidyr") # List all vignettes # Check package version packageVersion("ggplot2") # List all functions in a package ls("package:dplyr") # Package news (changelog) news(package = "ggplot2") ```