3 Data Import and Export

Learning Objectives

By the end of this chapter, you will be able to:

Import data from CSV, Excel, and other flat-file formats
Read data from databases using DBI
Handle common data import problems (encoding, separators, missing values)
Export data in multiple formats
Save and load R objects using .RDS and .RData

3.1 Reading Flat Files

Flat files are text-based files where each row is a record and columns are separated by a delimiter (comma, tab, semicolon, etc.).

3.1.1 CSV Files with `readr`

The readr package (part of the tidyverse) provides fast, consistent functions for reading delimited text files.

Code

library(readr)

# Read a comma-separated file
df <- read_csv("data/survey_results.csv")

# Read with explicit options
df <- read_csv(
  "data/survey_results.csv",
  col_types  = cols(
    id       = col_integer(),
    name     = col_character(),
    score    = col_double(),
    date     = col_date(format = "%Y-%m-%d")
  ),
  na         = c("", "NA", "N/A", "missing"),
  skip       = 2,          # Skip first 2 rows
  n_max      = 1000        # Read at most 1000 rows
)

The most important read_csv() arguments:

Table 3.1: Key arguments for readr::read_csv()

Argument	Purpose
file	Path to the file or a URL
col_types	Column type specification
col_names	Column names (TRUE = first row, FALSE = no header, character vector = explicit names)
na	Character vector of strings to treat as NA
skip	Number of rows to skip before reading
n_max	Maximum number of rows to read
locale	Locale for numbers, dates, and encoding

3.1.2 Base R vs. readr

Code

# Base R — still valid, slower for large files
# df_base <- read.csv("data.csv", stringsAsFactors = FALSE)

# readr — faster, returns a tibble, better defaults
# df_readr <- readr::read_csv("data.csv")

# Key differences:
# - read.csv() converts strings to factors by default (unless stringsAsFactors = FALSE)
# - read_csv() never converts strings to factors
# - read_csv() is significantly faster for large files
# - read_csv() provides column type messages on import

# Create sample data to demonstrate
sample_data <- data.frame(
  name    = c("Ravi", "Priya", "Suresh"),
  score   = c(85.5, 92.1, 78.3),
  passed  = c(TRUE, TRUE, FALSE),
  date    = as.Date(c("2024-01-15", "2024-01-16", "2024-01-17"))
)

# Write to temp file and re-read
tmp <- tempfile(fileext = ".csv")
write_csv(sample_data, tmp)
reimported <- read_csv(tmp, show_col_types = FALSE)
print(reimported)
#> # A tibble: 3 × 4
#>   name   score passed date      
#>   <chr>  <dbl> <lgl>  <date>    
#> 1 Ravi    85.5 TRUE   2024-01-15
#> 2 Priya   92.1 TRUE   2024-01-16
#> 3 Suresh  78.3 FALSE  2024-01-17

3.1.3 Tab-Separated and Other Delimiters

Code

# Tab-separated values
df_tsv <- read_tsv("data/results.tsv")

# Semicolon-separated (common in European data where comma is decimal separator)
df_semi <- read_csv2("data/results.csv")     # Uses ; as separator, , as decimal
df_semi <- read_delim("data/results.csv", delim = ";")

# Fixed-width files
df_fwf <- read_fwf("data/fixed_width.txt",
                   fwf_widths(c(10, 8, 5), c("name", "date", "value")))

3.1.4 URLs and Remote Files

Code

# Read directly from a URL
df_web <- read_csv("https://raw.githubusercontent.com/datasets/wheat-prices/master/data/wheat-prices.csv")

# From a compressed file
df_gz <- read_csv("data/large_file.csv.gz")

3.2 Reading Excel Files

Excel files are ubiquitous in government and corporate data. The readxl package handles them cleanly without requiring Excel to be installed.

Code

library(readxl)

# Read the first sheet
df_excel <- read_excel("data/crop_production.xlsx")

# Specify a sheet by name or position
df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = "Punjab")
df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = 2)

# Read a specific cell range
df_range <- read_excel("data/crop_production.xlsx",
                       sheet = "Summary",
                       range = "A1:D50")

# Skip rows and handle custom NAs
df_clean <- read_excel(
  "data/survey.xlsx",
  skip = 3,
  na   = c("", "N/A", "-", ".")
)

# List all sheet names
excel_sheets("data/crop_production.xlsx")

When Excel Dates Appear as Numbers

Excel stores dates as the number of days since January 1, 1900. If your date column imports as a numeric, use:

df$date_column <- as.Date(df$date_column, origin = "1899-12-30")

3.3 Connecting to Databases

For large datasets, it is more efficient to query a database than to load everything into memory.

Code

library(DBI)
library(RSQLite)     # Or RPostgres, RMySQL, odbc, etc.

# Connect to a SQLite database
con <- dbConnect(RSQLite::SQLite(), "data/my_database.sqlite")

# List tables
dbListTables(con)

# Query a table
df_query <- dbGetQuery(con, "SELECT * FROM crop_data WHERE year = 2023")

# Use dplyr syntax with dbplyr (lazy evaluation — SQL runs in the database)
library(dbplyr)
crop_tbl <- tbl(con, "crop_data")

result <- crop_tbl |>
  filter(year == 2023, state == "Punjab") |>
  select(crop, yield, area) |>
  collect()   # Pull results into R memory

# Always close the connection
dbDisconnect(con)

3.4 Reading Other Formats

Code

# SPSS files
library(haven)
df_spss <- read_spss("data/survey.sav")

# Stata files
df_stata <- read_stata("data/panel_data.dta")

# SAS files
df_sas <- read_sas("data/clinical_trial.sas7bdat")

# JSON
library(jsonlite)
df_json <- fromJSON("data/api_response.json", flatten = TRUE)

# Google Sheets (requires authentication)
library(googlesheets4)
gs4_auth()
df_gsheet <- read_sheet("https://docs.google.com/spreadsheets/d/YOUR_ID")

3.5 Writing Data

3.5.1 Writing Flat Files

Code

# Write CSV (tidyverse way)
write_csv(df, "output/results.csv")

# Write with NA representation
write_csv(df, "output/results.csv", na = "")

# Write TSV
write_tsv(df, "output/results.tsv")

# Base R equivalent
write.csv(df, "output/results.csv", row.names = FALSE)

3.5.2 Writing Excel Files

Code

library(writexl)

# Write a single data frame to Excel
write_xlsx(df, "output/results.xlsx")

# Write multiple data frames as separate sheets
write_xlsx(
  list(
    "Summary"  = summary_df,
    "Raw Data" = raw_df,
    "Models"   = model_results_df
  ),
  "output/full_report.xlsx"
)

3.6 Saving R Objects

Sometimes you want to save R objects — models, processed datasets, or lists — in a format that preserves all R-specific attributes.

3.6.1 `.RDS` Files — Single Objects

.RDS (R Data Serialisation) stores a single R object:

Code

# Save
model_fit <- lm(mpg ~ wt + hp, data = mtcars)
saveRDS(model_fit, "output/model_fit.rds")

# Load — assign to any name you want
m <- readRDS("output/model_fit.rds")
summary(m)

# RDS is ideal for:
# - Saving trained models
# - Caching slow computations
# - Passing objects between scripts

3.6.2 `.RData` Files — Multiple Objects

.RData stores multiple named objects from your workspace:

Code

x <- 1:10
y <- "hello"
df <- mtcars

# Save multiple objects
save(x, y, df, file = "output/my_workspace.RData")

# Save entire workspace
save.image("output/full_workspace.RData")

# Load — objects are restored with their original names
load("output/my_workspace.RData")   # x, y, df are now in your environment

.RDS vs. .RData

Prefer .RDS over .RData for most purposes because:

.RDS is explicit: you choose the name when loading (m <- readRDS(...))
.RData silently overwrites existing objects with the same name on load
.RDS is easier to use in pipelines and package code

3.7 The `here` Package: Robust File Paths

Hard-coded file paths break when you move your project or collaborate with others. The here package solves this by always building paths relative to the project root.

Code

library(here)

# here() finds the project root (where the .Rproj file lives)
here()
#> [1] "/home/pawan/statistics-with-R"

# Build a path relative to the project root
here("data", "survey_results.csv")   # Works on Windows, Mac, and Linux
#> [1] "/home/pawan/statistics-with-R/data/survey_results.csv"
here("output", "figures", "plot1.png")
#> [1] "/home/pawan/statistics-with-R/output/figures/plot1.png"

The Golden Rule of File Paths

Never use setwd() in a script. It makes your script non-portable. Instead, use RStudio Projects and the here package for all file paths.

3.8 Data Import Workflow

A typical data import workflow looks like this:

Code

library(tidyverse)
library(here)
library(janitor)   # For clean_names()

# 1. Import
raw <- read_csv(
  here("data", "raw", "crop_yields_2023.csv"),
  na          = c("", "NA", "-", ".."),
  show_col_types = FALSE
)

# 2. Inspect
glimpse(raw)
skimr::skim(raw)

# 3. Clean column names to snake_case
df <- raw |>
  janitor::clean_names()   # "Total Area (ha)" → "total_area_ha"

# 4. Save the cleaned version
write_csv(df, here("data", "processed", "crop_yields_2023_clean.csv"))
saveRDS(df, here("data", "processed", "crop_yields_2023_clean.rds"))

3.9 Exercises

Download a CSV file of your choice from the Indian government’s open data portal (data.gov.in) and import it using read_csv(). Report on how many rows and columns it has, and what column types were inferred.
Practice handling import problems: create a CSV file in a text editor where the decimal separator is a comma and the column separator is a semicolon. Import it using the appropriate readr function.
Save the penguins dataset (from palmerpenguins) to an RDS file. Then load it back in a fresh R session and verify it is identical to the original using identical().
Use the here package to set up a consistent file path structure for a hypothetical project with data/raw/, data/processed/, and output/ directories.
Challenge: Scrape or download monthly petrol price data for India (e.g., from the PPAC website). Import it and write a brief summary report in Quarto showing price trends.

# Data Import and Export {#sec-data-import} ```{r} #| label: setup-ch3 #| include: false library(tidyverse) library(readxl) ``` ::: {.callout-note} ## Learning Objectives By the end of this chapter, you will be able to: - Import data from CSV, Excel, and other flat-file formats - Read data from databases using DBI - Handle common data import problems (encoding, separators, missing values) - Export data in multiple formats - Save and load R objects using `.RDS` and `.RData` ::: ## Reading Flat Files {#sec-flat-files} **Flat files** are text-based files where each row is a record and columns are separated by a delimiter (comma, tab, semicolon, etc.). ### CSV Files with `readr` {#sec-readr} The `readr` package (part of the tidyverse) provides fast, consistent functions for reading delimited text files. ```{r} #| label: readr-demo #| eval: false library(readr) # Read a comma-separated file df <- read_csv("data/survey_results.csv") # Read with explicit options df <- read_csv( "data/survey_results.csv", col_types = cols( id = col_integer(), name = col_character(), score = col_double(), date = col_date(format = "%Y-%m-%d") ), na = c("", "NA", "N/A", "missing"), skip = 2, # Skip first 2 rows n_max = 1000 # Read at most 1000 rows ) ``` The most important `read_csv()` arguments: ```{r} #| label: tbl-readr-args #| echo: false #| tbl-cap: "Key arguments for `readr::read_csv()`" args_df <- data.frame( Argument = c("file", "col_types", "col_names", "na", "skip", "n_max", "locale"), Purpose = c( "Path to the file or a URL", "Column type specification", "Column names (TRUE = first row, FALSE = no header, character vector = explicit names)", "Character vector of strings to treat as NA", "Number of rows to skip before reading", "Maximum number of rows to read", "Locale for numbers, dates, and encoding" ) ) knitr::kable(args_df) ``` ### Base R vs. readr ```{r} #| label: base-vs-readr # Base R — still valid, slower for large files # df_base <- read.csv("data.csv", stringsAsFactors = FALSE) # readr — faster, returns a tibble, better defaults # df_readr <- readr::read_csv("data.csv") # Key differences: # - read.csv() converts strings to factors by default (unless stringsAsFactors = FALSE) # - read_csv() never converts strings to factors # - read_csv() is significantly faster for large files # - read_csv() provides column type messages on import # Create sample data to demonstrate sample_data <- data.frame( name = c("Ravi", "Priya", "Suresh"), score = c(85.5, 92.1, 78.3), passed = c(TRUE, TRUE, FALSE), date = as.Date(c("2024-01-15", "2024-01-16", "2024-01-17")) ) # Write to temp file and re-read tmp <- tempfile(fileext = ".csv") write_csv(sample_data, tmp) reimported <- read_csv(tmp, show_col_types = FALSE) print(reimported) ``` ### Tab-Separated and Other Delimiters ```{r} #| eval: false # Tab-separated values df_tsv <- read_tsv("data/results.tsv") # Semicolon-separated (common in European data where comma is decimal separator) df_semi <- read_csv2("data/results.csv") # Uses ; as separator, , as decimal df_semi <- read_delim("data/results.csv", delim = ";") # Fixed-width files df_fwf <- read_fwf("data/fixed_width.txt", fwf_widths(c(10, 8, 5), c("name", "date", "value"))) ``` ### URLs and Remote Files ```{r} #| eval: false # Read directly from a URL df_web <- read_csv("https://raw.githubusercontent.com/datasets/wheat-prices/master/data/wheat-prices.csv") # From a compressed file df_gz <- read_csv("data/large_file.csv.gz") ``` ## Reading Excel Files {#sec-excel} Excel files are ubiquitous in government and corporate data. The `readxl` package handles them cleanly without requiring Excel to be installed. ```{r} #| eval: false library(readxl) # Read the first sheet df_excel <- read_excel("data/crop_production.xlsx") # Specify a sheet by name or position df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = "Punjab") df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = 2) # Read a specific cell range df_range <- read_excel("data/crop_production.xlsx", sheet = "Summary", range = "A1:D50") # Skip rows and handle custom NAs df_clean <- read_excel( "data/survey.xlsx", skip = 3, na = c("", "N/A", "-", ".") ) # List all sheet names excel_sheets("data/crop_production.xlsx") ``` ::: {.callout-tip} ## When Excel Dates Appear as Numbers Excel stores dates as the number of days since January 1, 1900. If your date column imports as a numeric, use: ```r df$date_column <- as.Date(df$date_column, origin = "1899-12-30") ``` ::: ## Connecting to Databases {#sec-databases} For large datasets, it is more efficient to query a database than to load everything into memory. ```{r} #| eval: false library(DBI) library(RSQLite) # Or RPostgres, RMySQL, odbc, etc. # Connect to a SQLite database con <- dbConnect(RSQLite::SQLite(), "data/my_database.sqlite") # List tables dbListTables(con) # Query a table df_query <- dbGetQuery(con, "SELECT * FROM crop_data WHERE year = 2023") # Use dplyr syntax with dbplyr (lazy evaluation — SQL runs in the database) library(dbplyr) crop_tbl <- tbl(con, "crop_data") result <- crop_tbl |> filter(year == 2023, state == "Punjab") |> select(crop, yield, area) |> collect() # Pull results into R memory # Always close the connection dbDisconnect(con) ``` ## Reading Other Formats {#sec-other-formats} ```{r} #| eval: false # SPSS files library(haven) df_spss <- read_spss("data/survey.sav") # Stata files df_stata <- read_stata("data/panel_data.dta") # SAS files df_sas <- read_sas("data/clinical_trial.sas7bdat") # JSON library(jsonlite) df_json <- fromJSON("data/api_response.json", flatten = TRUE) # Google Sheets (requires authentication) library(googlesheets4) gs4_auth() df_gsheet <- read_sheet("https://docs.google.com/spreadsheets/d/YOUR_ID") ``` ## Writing Data {#sec-writing} ### Writing Flat Files ```{r} #| eval: false # Write CSV (tidyverse way) write_csv(df, "output/results.csv") # Write with NA representation write_csv(df, "output/results.csv", na = "") # Write TSV write_tsv(df, "output/results.tsv") # Base R equivalent write.csv(df, "output/results.csv", row.names = FALSE) ``` ### Writing Excel Files ```{r} #| eval: false library(writexl) # Write a single data frame to Excel write_xlsx(df, "output/results.xlsx") # Write multiple data frames as separate sheets write_xlsx( list( "Summary" = summary_df, "Raw Data" = raw_df, "Models" = model_results_df ), "output/full_report.xlsx" ) ``` ## Saving R Objects {#sec-rds-rdata} Sometimes you want to save R objects — models, processed datasets, or lists — in a format that preserves all R-specific attributes. ### `.RDS` Files — Single Objects **`.RDS`** (R Data Serialisation) stores a *single* R object: ```{r} #| label: rds-demo #| eval: false # Save model_fit <- lm(mpg ~ wt + hp, data = mtcars) saveRDS(model_fit, "output/model_fit.rds") # Load — assign to any name you want m <- readRDS("output/model_fit.rds") summary(m) # RDS is ideal for: # - Saving trained models # - Caching slow computations # - Passing objects between scripts ``` ### `.RData` Files — Multiple Objects **`.RData`** stores *multiple* named objects from your workspace: ```{r} #| eval: false x <- 1:10 y <- "hello" df <- mtcars # Save multiple objects save(x, y, df, file = "output/my_workspace.RData") # Save entire workspace save.image("output/full_workspace.RData") # Load — objects are restored with their original names load("output/my_workspace.RData") # x, y, df are now in your environment ``` ::: {.callout-warning} ## `.RDS` vs. `.RData` Prefer `.RDS` over `.RData` for most purposes because: - `.RDS` is explicit: you choose the name when loading (`m <- readRDS(...)`) - `.RData` silently overwrites existing objects with the same name on load - `.RDS` is easier to use in pipelines and package code ::: ## The `here` Package: Robust File Paths {#sec-here} Hard-coded file paths break when you move your project or collaborate with others. The `here` package solves this by always building paths relative to the **project root**. ```{r} #| label: here-demo library(here) # here() finds the project root (where the .Rproj file lives) here() # Build a path relative to the project root here("data", "survey_results.csv") # Works on Windows, Mac, and Linux here("output", "figures", "plot1.png") ``` ::: {.callout-tip} ## The Golden Rule of File Paths **Never use `setwd()` in a script.** It makes your script non-portable. Instead, use RStudio Projects and the `here` package for all file paths. ::: ## Data Import Workflow {#sec-import-workflow} A typical data import workflow looks like this: ```{r} #| label: import-workflow #| eval: false library(tidyverse) library(here) library(janitor) # For clean_names() # 1. Import raw <- read_csv( here("data", "raw", "crop_yields_2023.csv"), na = c("", "NA", "-", ".."), show_col_types = FALSE ) # 2. Inspect glimpse(raw) skimr::skim(raw) # 3. Clean column names to snake_case df <- raw |> janitor::clean_names() # "Total Area (ha)" → "total_area_ha" # 4. Save the cleaned version write_csv(df, here("data", "processed", "crop_yields_2023_clean.csv")) saveRDS(df, here("data", "processed", "crop_yields_2023_clean.rds")) ``` ## Exercises {#sec-ch3-exercises} 1. Download a CSV file of your choice from the Indian government's open data portal ([data.gov.in](https://data.gov.in)) and import it using `read_csv()`. Report on how many rows and columns it has, and what column types were inferred. 2. Practice handling import problems: create a CSV file in a text editor where the decimal separator is a comma and the column separator is a semicolon. Import it using the appropriate `readr` function. 3. Save the `penguins` dataset (from `palmerpenguins`) to an RDS file. Then load it back in a fresh R session and verify it is identical to the original using `identical()`. 4. Use the `here` package to set up a consistent file path structure for a hypothetical project with `data/raw/`, `data/processed/`, and `output/` directories. 5. **Challenge:** Scrape or download monthly petrol price data for India (e.g., from the PPAC website). Import it and write a brief summary report in Quarto showing price trends.

3.1 Reading Flat Files

3.1.1 CSV Files with readr

3.1.2 Base R vs. readr

3.1.3 Tab-Separated and Other Delimiters

3.1.4 URLs and Remote Files

3.2 Reading Excel Files

3.3 Connecting to Databases

3.4 Reading Other Formats

3.5 Writing Data

3.5.1 Writing Flat Files

3.5.2 Writing Excel Files

3.6 Saving R Objects

3.6.1 .RDS Files — Single Objects

3.6.2 .RData Files — Multiple Objects

3.7 The here Package: Robust File Paths

3.8 Data Import Workflow

3.9 Exercises

3.1.1 CSV Files with `readr`

3.6.1 `.RDS` Files — Single Objects

3.6.2 `.RData` Files — Multiple Objects

3.7 The `here` Package: Robust File Paths