1  Introduction to the R Ecosystem

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Explain what R is and why it is used in data science and research
  • Compare R with Python, SAS, and SPSS
  • Navigate the RStudio IDE confidently
  • Install and load R packages
  • Create and manipulate R’s fundamental data structures

1.1 What Is R?

R is a programming language and environment designed specifically for statistical computing and data visualisation. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in the early 1990s, and is today maintained by the R Core Team with contributions from thousands of developers worldwide.

R is free and open-source, distributed under the GNU General Public Licence. This means it costs nothing to use, and anyone can inspect, modify, or extend it.

TipWhy Researchers Love R

The reason R is so popular in academia and research is not just that it is free — it is that the statistical community builds in R. New statistical methods are almost always published as R packages on CRAN (the Comprehensive R Archive Network) before they appear in any commercial software.

1.1.1 R vs. Python, SAS, and SPSS

All four tools can perform statistical analysis. The differences matter, though:

Table 1.1: Comparison of Statistical Software
Feature R Python SAS SPSS
Cost Free Free Expensive Expensive
Statistical depth ★★★★★ ★★★★☆ ★★★★★ ★★★☆☆
Visualisation ★★★★★ ★★★★☆ ★★★☆☆ ★★☆☆☆
Machine Learning ★★★☆☆ ★★★★★ ★★★☆☆ ★★☆☆☆
Reproducibility ★★★★★ ★★★★☆ ★★★☆☆ ★★☆☆☆
Community (Stats) ★★★★★ ★★★☆☆ ★★★☆☆ ★★☆☆☆

The bottom line: If your work is primarily statistical analysis and research, R is the best tool. If you need deep machine learning or production deployment pipelines, Python is often the better choice. SAS and SPSS are legacy tools found mainly in industries with long-standing institutional inertia.

1.2 The R GUI vs. RStudio IDE

When you install R, you get a minimal graphical user interface (GUI) — essentially a console where you can type commands. This is functional but inconvenient for any serious work.

RStudio (now rebranded as Posit) is an Integrated Development Environment (IDE) that wraps around R and makes it vastly more productive. It is the tool used throughout this book.

1.2.1 The Four Panes of RStudio

RStudio organises itself into four panes (Figure 1.1):

The four panes of RStudio
Pane Position Primary Purpose
Source Editor Top-Left Write and save scripts (.R, .qmd)
Console Bottom-Left Run code interactively
Environment/History Top-Right See objects in memory
Files/Plots/Help Bottom-Right Navigate files, view plots
Figure 1.1: Conceptual layout of the RStudio IDE with its four main panes.

Pane 1 — Source Editor (top-left): Where you write scripts. Files with the .R extension are plain R scripts; .qmd files are Quarto documents (covered in Chapter 2).

Pane 2 — Console (bottom-left): The live R session. Code runs here. You can also type directly into the console for quick experiments, but anything you want to keep should go in a script.

Pane 3 — Environment / History (top-right): Lists every object currently in R’s memory — data frames, vectors, models, and so on. The History tab shows every command you have run.

Pane 4 — Files / Plots / Packages / Help (bottom-right): A multi-purpose panel. Use Files to navigate your project directory, Plots to see graphics, Packages to manage installed packages, and Help to read documentation.

1.2.2 RStudio Projects

One of the most important habits to develop is using RStudio Projects. A project is simply a directory with a .Rproj file that tells RStudio where the “root” of your work is.

To create a new project: File → New Project → New Directory → New Project.

ImportantAlways Work Inside a Project

Using RStudio Projects ensures that setwd() is never needed. All file paths become relative to the project root, making your work portable and reproducible on any machine.

1.3 Installing and Managing Packages

Base R is powerful, but its real strength comes from the ecosystem of packages — collections of functions, data, and documentation that extend R’s capabilities.

1.3.1 Installing Packages

Code
# Install a single package from CRAN
install.packages("ggplot2")

# Install multiple packages at once
install.packages(c("dplyr", "tidyr", "readr"))

# Install a development version from GitHub
# remotes::install_github("tidyverse/ggplot2")

install.packages() downloads a package to your computer — you do this once.

library() loads a package into the current R session — you do this every time you start a new session or write a script that uses that package.

1.3.2 Loading Packages

Code
# Load a package for use in the current session
library(ggplot2)
library(dplyr)

# Check which packages are currently loaded
# search()

1.3.3 The Tidyverse Meta-Package

The tidyverse is a curated collection of packages that share a common design philosophy. Installing and loading it gives you eight core packages at once:

Code
install.packages("tidyverse")
Code
library(tidyverse)
# Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

1.3.4 Package Management with renv

For serious projects, use renv to create a project-specific package library. This freezes the exact version of every package, so your analysis will produce identical results even years later.

Code
install.packages("renv")
renv::init()        # Initialise renv for the current project
renv::snapshot()    # Record current package versions
renv::restore()     # Reinstall packages from the snapshot

1.4 Introduction to R Objects

Everything in R is an object. Understanding the different types of objects is fundamental to using R effectively.

1.4.1 Vectors

The most basic object in R is a vector — an ordered collection of values of the same type.

Code
# Numeric vector
heights <- c(165, 172, 158, 180, 169)
print(heights)
#> [1] 165 172 158 180 169

# Character vector
cities <- c("Delhi", "Mumbai", "Kolkata", "Chennai")
print(cities)
#> [1] "Delhi"   "Mumbai"  "Kolkata" "Chennai"

# Logical vector
above_170 <- heights > 170
print(above_170)
#> [1] FALSE  TRUE FALSE  TRUE FALSE

# Integer vector (note the L suffix)
counts <- c(1L, 5L, 3L, 7L, 2L)
class(counts)
#> [1] "integer"

Subsetting vectors uses square brackets []:

Code
# First element
heights[1]
#> [1] 165

# Elements 2 through 4
heights[2:4]
#> [1] 172 158 180

# Elements matching a condition
heights[heights > 170]
#> [1] 172 180

# Using a logical vector to subset
heights[above_170]
#> [1] 172 180

1.4.2 Factors

Factors represent categorical variables. They look like characters but store the data as integers internally, which is more memory-efficient and important for statistical modelling.

Code
income_group <- factor(
  c("Low", "Medium", "High", "Medium", "Low", "High"),
  levels = c("Low", "Medium", "High"),
  ordered = TRUE
)

print(income_group)
#> [1] Low    Medium High   Medium Low    High  
#> Levels: Low < Medium < High
levels(income_group)
#> [1] "Low"    "Medium" "High"
table(income_group)
#> income_group
#>    Low Medium   High 
#>      2      2      2

1.4.3 Matrices

A matrix is a two-dimensional vector — all elements must be of the same type.

Code
# Create a 3x3 matrix
m <- matrix(1:9, nrow = 3, ncol = 3)
print(m)
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9

# Matrix operations
m[2, 3]          # Row 2, column 3
#> [1] 8
m[1, ]           # Entire first row
#> [1] 1 4 7
m[, 2]           # Entire second column
#> [1] 4 5 6
t(m)             # Transpose
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    4    5    6
#> [3,]    7    8    9
m %*% m          # Matrix multiplication
#>      [,1] [,2] [,3]
#> [1,]   30   66  102
#> [2,]   36   81  126
#> [3,]   42   96  150

1.4.4 Lists

A list is a collection of objects that can be of different types — including other lists. Lists are the most flexible data structure in R.

Code
survey_result <- list(
  respondent_id = 101,
  name = "Aisha",
  age = 28,
  responses = c(4, 5, 3, 5, 4),
  completed = TRUE
)

# Accessing list elements
survey_result$name
#> [1] "Aisha"
survey_result[["age"]]
#> [1] 28
survey_result[[4]]         # Fourth element (responses)
#> [1] 4 5 3 5 4
length(survey_result)
#> [1] 5

1.4.5 Data Frames

The data frame is the workhorse of data analysis in R. It is a list of vectors of equal length — think of it as a spreadsheet or database table.

Code
# Create a data frame manually
crop_data <- data.frame(
  state    = c("Punjab", "Haryana", "UP", "MP", "Rajasthan"),
  crop     = c("Wheat", "Wheat", "Rice", "Soybean", "Mustard"),
  yield_qt = c(50.2, 47.8, 32.1, 12.5, 10.3),
  year     = rep(2023, 5)
)

print(crop_data)
#>       state    crop yield_qt year
#> 1    Punjab   Wheat     50.2 2023
#> 2   Haryana   Wheat     47.8 2023
#> 3        UP    Rice     32.1 2023
#> 4        MP Soybean     12.5 2023
#> 5 Rajasthan Mustard     10.3 2023
str(crop_data)       # Structure of the object
#> 'data.frame':    5 obs. of  4 variables:
#>  $ state   : chr  "Punjab" "Haryana" "UP" "MP" ...
#>  $ crop    : chr  "Wheat" "Wheat" "Rice" "Soybean" ...
#>  $ yield_qt: num  50.2 47.8 32.1 12.5 10.3
#>  $ year    : num  2023 2023 2023 2023 2023
nrow(crop_data)      # Number of rows
#> [1] 5
ncol(crop_data)      # Number of columns
#> [1] 4
names(crop_data)     # Column names
#> [1] "state"    "crop"     "yield_qt" "year"

Subsetting data frames:

Code
# Single column as vector
crop_data$yield_qt
#> [1] 50.2 47.8 32.1 12.5 10.3

# Single column as data frame
crop_data["state"]
#>       state
#> 1    Punjab
#> 2   Haryana
#> 3        UP
#> 4        MP
#> 5 Rajasthan

# Rows 1-3, all columns
crop_data[1:3, ]
#>     state  crop yield_qt year
#> 1  Punjab Wheat     50.2 2023
#> 2 Haryana Wheat     47.8 2023
#> 3      UP  Rice     32.1 2023

# Filter rows by condition
crop_data[crop_data$yield_qt > 30, ]
#>     state  crop yield_qt year
#> 1  Punjab Wheat     50.2 2023
#> 2 Haryana Wheat     47.8 2023
#> 3      UP  Rice     32.1 2023

1.4.6 Tibbles: The Modern Data Frame

The tibble (tbl_df) is the tidyverse’s improved version of the data frame. It prints more cleanly and behaves more predictably.

Code
library(tibble)

crop_tibble <- as_tibble(crop_data)
print(crop_tibble)
#> # A tibble: 5 × 4
#>   state     crop    yield_qt  year
#>   <chr>     <chr>      <dbl> <dbl>
#> 1 Punjab    Wheat       50.2  2023
#> 2 Haryana   Wheat       47.8  2023
#> 3 UP        Rice        32.1  2023
#> 4 MP        Soybean     12.5  2023
#> 5 Rajasthan Mustard     10.3  2023

# Tibbles never convert strings to factors by default
# Tibbles print only the first 10 rows and as many columns as fit

Table 1.2 summarises the main R object types:

Table 1.2: Summary of core R object types
Object Dimensions Types Allowed Primary Use
vector 1D One Single variable
factor 1D One (integer-coded) Categorical variable
matrix 2D One Linear algebra
list 1D (heterogeneous) Many Flexible container
data.frame 2D One per column Dataset
tibble 2D One per column Modern dataset

1.5 Basic Operations and Arithmetic

R uses standard mathematical operators plus several important ones for data work:

Code
# Basic arithmetic
5 + 3       # Addition
#> [1] 8
10 - 4      # Subtraction
#> [1] 6
6 * 7       # Multiplication
#> [1] 42
15 / 4      # Division
#> [1] 3.75
15 %/% 4    # Integer division
#> [1] 3
15 %% 4     # Modulo (remainder)
#> [1] 3
2^10        # Exponentiation
#> [1] 1024

# Comparison operators
5 > 3
#> [1] TRUE
5 == 5
#> [1] TRUE
5 != 3
#> [1] TRUE
5 >= 5
#> [1] TRUE

# Logical operators
TRUE & FALSE    # AND
#> [1] FALSE
TRUE | FALSE    # OR
#> [1] TRUE
!TRUE           # NOT
#> [1] FALSE

1.6 Getting Help

R has an excellent built-in help system:

Code
# Get help for a function
?mean
help("lm")

# Search help files
??regression
help.search("linear model")

# See examples for a function
example(plot)

# Find which package a function belongs to
# apropos("melt")

1.7 Exercises

  1. Install the palmerpenguins package and load it. How many rows and columns does the penguins dataset have?

  2. Create a numeric vector called temperatures containing the average monthly temperatures (in Celsius) for a city of your choice. Calculate the mean, median, and standard deviation.

  3. Create a data frame with at least four columns describing five agricultural commodities (name, price, quantity produced, state of origin). Subset it to show only commodities produced in your home state.

  4. Convert the species column in palmerpenguins::penguins to a factor and count how many penguins belong to each species.

  5. What is the difference between = and <- for assignment in R? When would you use one vs. the other?