| Pane | Position | Primary Purpose |
|---|---|---|
| Source Editor | Top-Left | Write and save scripts (.R, .qmd) |
| Console | Bottom-Left | Run code interactively |
| Environment/History | Top-Right | See objects in memory |
| Files/Plots/Help | Bottom-Right | Navigate files, view plots |
1 Introduction to the R Ecosystem
1.1 What Is R?
R is a programming language and environment designed specifically for statistical computing and data visualisation. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in the early 1990s, and is today maintained by the R Core Team with contributions from thousands of developers worldwide.
R is free and open-source, distributed under the GNU General Public Licence. This means it costs nothing to use, and anyone can inspect, modify, or extend it.
1.1.1 R vs. Python, SAS, and SPSS
All four tools can perform statistical analysis. The differences matter, though:
| Feature | R | Python | SAS | SPSS |
|---|---|---|---|---|
| Cost | Free | Free | Expensive | Expensive |
| Statistical depth | ★★★★★ | ★★★★☆ | ★★★★★ | ★★★☆☆ |
| Visualisation | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Machine Learning | ★★★☆☆ | ★★★★★ | ★★★☆☆ | ★★☆☆☆ |
| Reproducibility | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Community (Stats) | ★★★★★ | ★★★☆☆ | ★★★☆☆ | ★★☆☆☆ |
The bottom line: If your work is primarily statistical analysis and research, R is the best tool. If you need deep machine learning or production deployment pipelines, Python is often the better choice. SAS and SPSS are legacy tools found mainly in industries with long-standing institutional inertia.
1.2 The R GUI vs. RStudio IDE
When you install R, you get a minimal graphical user interface (GUI) — essentially a console where you can type commands. This is functional but inconvenient for any serious work.
RStudio (now rebranded as Posit) is an Integrated Development Environment (IDE) that wraps around R and makes it vastly more productive. It is the tool used throughout this book.
1.2.1 The Four Panes of RStudio
RStudio organises itself into four panes (Figure 1.1):
Pane 1 — Source Editor (top-left): Where you write scripts. Files with the .R extension are plain R scripts; .qmd files are Quarto documents (covered in Chapter 2).
Pane 2 — Console (bottom-left): The live R session. Code runs here. You can also type directly into the console for quick experiments, but anything you want to keep should go in a script.
Pane 3 — Environment / History (top-right): Lists every object currently in R’s memory — data frames, vectors, models, and so on. The History tab shows every command you have run.
Pane 4 — Files / Plots / Packages / Help (bottom-right): A multi-purpose panel. Use Files to navigate your project directory, Plots to see graphics, Packages to manage installed packages, and Help to read documentation.
1.2.2 RStudio Projects
One of the most important habits to develop is using RStudio Projects. A project is simply a directory with a .Rproj file that tells RStudio where the “root” of your work is.
To create a new project: File → New Project → New Directory → New Project.
1.3 Installing and Managing Packages
Base R is powerful, but its real strength comes from the ecosystem of packages — collections of functions, data, and documentation that extend R’s capabilities.
1.3.1 Installing Packages
Code
# Install a single package from CRAN
install.packages("ggplot2")
# Install multiple packages at once
install.packages(c("dplyr", "tidyr", "readr"))
# Install a development version from GitHub
# remotes::install_github("tidyverse/ggplot2")1.3.2 Loading Packages
1.3.3 The Tidyverse Meta-Package
The tidyverse is a curated collection of packages that share a common design philosophy. Installing and loading it gives you eight core packages at once:
Code
install.packages("tidyverse")
1.3.4 Package Management with renv
For serious projects, use renv to create a project-specific package library. This freezes the exact version of every package, so your analysis will produce identical results even years later.
Code
install.packages("renv")
renv::init() # Initialise renv for the current project
renv::snapshot() # Record current package versions
renv::restore() # Reinstall packages from the snapshot1.4 Introduction to R Objects
Everything in R is an object. Understanding the different types of objects is fundamental to using R effectively.
1.4.1 Vectors
The most basic object in R is a vector — an ordered collection of values of the same type.
Code
# Numeric vector
heights <- c(165, 172, 158, 180, 169)
print(heights)
#> [1] 165 172 158 180 169
# Character vector
cities <- c("Delhi", "Mumbai", "Kolkata", "Chennai")
print(cities)
#> [1] "Delhi" "Mumbai" "Kolkata" "Chennai"
# Logical vector
above_170 <- heights > 170
print(above_170)
#> [1] FALSE TRUE FALSE TRUE FALSE
# Integer vector (note the L suffix)
counts <- c(1L, 5L, 3L, 7L, 2L)
class(counts)
#> [1] "integer"Subsetting vectors uses square brackets []:
Code
# First element
heights[1]
#> [1] 165
# Elements 2 through 4
heights[2:4]
#> [1] 172 158 180
# Elements matching a condition
heights[heights > 170]
#> [1] 172 180
# Using a logical vector to subset
heights[above_170]
#> [1] 172 1801.4.2 Factors
Factors represent categorical variables. They look like characters but store the data as integers internally, which is more memory-efficient and important for statistical modelling.
Code
income_group <- factor(
c("Low", "Medium", "High", "Medium", "Low", "High"),
levels = c("Low", "Medium", "High"),
ordered = TRUE
)
print(income_group)
#> [1] Low Medium High Medium Low High
#> Levels: Low < Medium < High
levels(income_group)
#> [1] "Low" "Medium" "High"
table(income_group)
#> income_group
#> Low Medium High
#> 2 2 21.4.3 Matrices
A matrix is a two-dimensional vector — all elements must be of the same type.
Code
# Create a 3x3 matrix
m <- matrix(1:9, nrow = 3, ncol = 3)
print(m)
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
# Matrix operations
m[2, 3] # Row 2, column 3
#> [1] 8
m[1, ] # Entire first row
#> [1] 1 4 7
m[, 2] # Entire second column
#> [1] 4 5 6
t(m) # Transpose
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
#> [3,] 7 8 9
m %*% m # Matrix multiplication
#> [,1] [,2] [,3]
#> [1,] 30 66 102
#> [2,] 36 81 126
#> [3,] 42 96 1501.4.4 Lists
A list is a collection of objects that can be of different types — including other lists. Lists are the most flexible data structure in R.
Code
survey_result <- list(
respondent_id = 101,
name = "Aisha",
age = 28,
responses = c(4, 5, 3, 5, 4),
completed = TRUE
)
# Accessing list elements
survey_result$name
#> [1] "Aisha"
survey_result[["age"]]
#> [1] 28
survey_result[[4]] # Fourth element (responses)
#> [1] 4 5 3 5 4
length(survey_result)
#> [1] 51.4.5 Data Frames
The data frame is the workhorse of data analysis in R. It is a list of vectors of equal length — think of it as a spreadsheet or database table.
Code
# Create a data frame manually
crop_data <- data.frame(
state = c("Punjab", "Haryana", "UP", "MP", "Rajasthan"),
crop = c("Wheat", "Wheat", "Rice", "Soybean", "Mustard"),
yield_qt = c(50.2, 47.8, 32.1, 12.5, 10.3),
year = rep(2023, 5)
)
print(crop_data)
#> state crop yield_qt year
#> 1 Punjab Wheat 50.2 2023
#> 2 Haryana Wheat 47.8 2023
#> 3 UP Rice 32.1 2023
#> 4 MP Soybean 12.5 2023
#> 5 Rajasthan Mustard 10.3 2023
str(crop_data) # Structure of the object
#> 'data.frame': 5 obs. of 4 variables:
#> $ state : chr "Punjab" "Haryana" "UP" "MP" ...
#> $ crop : chr "Wheat" "Wheat" "Rice" "Soybean" ...
#> $ yield_qt: num 50.2 47.8 32.1 12.5 10.3
#> $ year : num 2023 2023 2023 2023 2023
nrow(crop_data) # Number of rows
#> [1] 5
ncol(crop_data) # Number of columns
#> [1] 4
names(crop_data) # Column names
#> [1] "state" "crop" "yield_qt" "year"Subsetting data frames:
Code
# Single column as vector
crop_data$yield_qt
#> [1] 50.2 47.8 32.1 12.5 10.3
# Single column as data frame
crop_data["state"]
#> state
#> 1 Punjab
#> 2 Haryana
#> 3 UP
#> 4 MP
#> 5 Rajasthan
# Rows 1-3, all columns
crop_data[1:3, ]
#> state crop yield_qt year
#> 1 Punjab Wheat 50.2 2023
#> 2 Haryana Wheat 47.8 2023
#> 3 UP Rice 32.1 2023
# Filter rows by condition
crop_data[crop_data$yield_qt > 30, ]
#> state crop yield_qt year
#> 1 Punjab Wheat 50.2 2023
#> 2 Haryana Wheat 47.8 2023
#> 3 UP Rice 32.1 20231.4.6 Tibbles: The Modern Data Frame
The tibble (tbl_df) is the tidyverse’s improved version of the data frame. It prints more cleanly and behaves more predictably.
Code
library(tibble)
crop_tibble <- as_tibble(crop_data)
print(crop_tibble)
#> # A tibble: 5 × 4
#> state crop yield_qt year
#> <chr> <chr> <dbl> <dbl>
#> 1 Punjab Wheat 50.2 2023
#> 2 Haryana Wheat 47.8 2023
#> 3 UP Rice 32.1 2023
#> 4 MP Soybean 12.5 2023
#> 5 Rajasthan Mustard 10.3 2023
# Tibbles never convert strings to factors by default
# Tibbles print only the first 10 rows and as many columns as fitTable 1.2 summarises the main R object types:
| Object | Dimensions | Types Allowed | Primary Use |
|---|---|---|---|
| vector | 1D | One | Single variable |
| factor | 1D | One (integer-coded) | Categorical variable |
| matrix | 2D | One | Linear algebra |
| list | 1D (heterogeneous) | Many | Flexible container |
| data.frame | 2D | One per column | Dataset |
| tibble | 2D | One per column | Modern dataset |
1.5 Basic Operations and Arithmetic
R uses standard mathematical operators plus several important ones for data work:
Code
# Basic arithmetic
5 + 3 # Addition
#> [1] 8
10 - 4 # Subtraction
#> [1] 6
6 * 7 # Multiplication
#> [1] 42
15 / 4 # Division
#> [1] 3.75
15 %/% 4 # Integer division
#> [1] 3
15 %% 4 # Modulo (remainder)
#> [1] 3
2^10 # Exponentiation
#> [1] 1024
# Comparison operators
5 > 3
#> [1] TRUE
5 == 5
#> [1] TRUE
5 != 3
#> [1] TRUE
5 >= 5
#> [1] TRUE
# Logical operators
TRUE & FALSE # AND
#> [1] FALSE
TRUE | FALSE # OR
#> [1] TRUE
!TRUE # NOT
#> [1] FALSE1.6 Getting Help
R has an excellent built-in help system:
Code
# Get help for a function
?mean
help("lm")
# Search help files
??regression
help.search("linear model")
# See examples for a function
example(plot)
# Find which package a function belongs to
# apropos("melt")1.7 Exercises
Install the
palmerpenguinspackage and load it. How many rows and columns does thepenguinsdataset have?Create a numeric vector called
temperaturescontaining the average monthly temperatures (in Celsius) for a city of your choice. Calculate the mean, median, and standard deviation.Create a data frame with at least four columns describing five agricultural commodities (name, price, quantity produced, state of origin). Subset it to show only commodities produced in your home state.
Convert the
speciescolumn inpalmerpenguins::penguinsto a factor and count how many penguins belong to each species.What is the difference between
=and<-for assignment in R? When would you use one vs. the other?