3  Data Import and Export

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Import data from CSV, Excel, and other flat-file formats
  • Read data from databases using DBI
  • Handle common data import problems (encoding, separators, missing values)
  • Export data in multiple formats
  • Save and load R objects using .RDS and .RData

3.1 Reading Flat Files

Flat files are text-based files where each row is a record and columns are separated by a delimiter (comma, tab, semicolon, etc.).

3.1.1 CSV Files with readr

The readr package (part of the tidyverse) provides fast, consistent functions for reading delimited text files.

Code
library(readr)

# Read a comma-separated file
df <- read_csv("data/survey_results.csv")

# Read with explicit options
df <- read_csv(
  "data/survey_results.csv",
  col_types  = cols(
    id       = col_integer(),
    name     = col_character(),
    score    = col_double(),
    date     = col_date(format = "%Y-%m-%d")
  ),
  na         = c("", "NA", "N/A", "missing"),
  skip       = 2,          # Skip first 2 rows
  n_max      = 1000        # Read at most 1000 rows
)

The most important read_csv() arguments:

Table 3.1: Key arguments for readr::read_csv()
Argument Purpose
file Path to the file or a URL
col_types Column type specification
col_names Column names (TRUE = first row, FALSE = no header, character vector = explicit names)
na Character vector of strings to treat as NA
skip Number of rows to skip before reading
n_max Maximum number of rows to read
locale Locale for numbers, dates, and encoding

3.1.2 Base R vs. readr

Code
# Base R — still valid, slower for large files
# df_base <- read.csv("data.csv", stringsAsFactors = FALSE)

# readr — faster, returns a tibble, better defaults
# df_readr <- readr::read_csv("data.csv")

# Key differences:
# - read.csv() converts strings to factors by default (unless stringsAsFactors = FALSE)
# - read_csv() never converts strings to factors
# - read_csv() is significantly faster for large files
# - read_csv() provides column type messages on import

# Create sample data to demonstrate
sample_data <- data.frame(
  name    = c("Ravi", "Priya", "Suresh"),
  score   = c(85.5, 92.1, 78.3),
  passed  = c(TRUE, TRUE, FALSE),
  date    = as.Date(c("2024-01-15", "2024-01-16", "2024-01-17"))
)

# Write to temp file and re-read
tmp <- tempfile(fileext = ".csv")
write_csv(sample_data, tmp)
reimported <- read_csv(tmp, show_col_types = FALSE)
print(reimported)
#> # A tibble: 3 × 4
#>   name   score passed date      
#>   <chr>  <dbl> <lgl>  <date>    
#> 1 Ravi    85.5 TRUE   2024-01-15
#> 2 Priya   92.1 TRUE   2024-01-16
#> 3 Suresh  78.3 FALSE  2024-01-17

3.1.3 Tab-Separated and Other Delimiters

Code
# Tab-separated values
df_tsv <- read_tsv("data/results.tsv")

# Semicolon-separated (common in European data where comma is decimal separator)
df_semi <- read_csv2("data/results.csv")     # Uses ; as separator, , as decimal
df_semi <- read_delim("data/results.csv", delim = ";")

# Fixed-width files
df_fwf <- read_fwf("data/fixed_width.txt",
                   fwf_widths(c(10, 8, 5), c("name", "date", "value")))

3.1.4 URLs and Remote Files

Code
# Read directly from a URL
df_web <- read_csv("https://raw.githubusercontent.com/datasets/wheat-prices/master/data/wheat-prices.csv")

# From a compressed file
df_gz <- read_csv("data/large_file.csv.gz")

3.2 Reading Excel Files

Excel files are ubiquitous in government and corporate data. The readxl package handles them cleanly without requiring Excel to be installed.

Code
library(readxl)

# Read the first sheet
df_excel <- read_excel("data/crop_production.xlsx")

# Specify a sheet by name or position
df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = "Punjab")
df_sheet2 <- read_excel("data/crop_production.xlsx", sheet = 2)

# Read a specific cell range
df_range <- read_excel("data/crop_production.xlsx",
                       sheet = "Summary",
                       range = "A1:D50")

# Skip rows and handle custom NAs
df_clean <- read_excel(
  "data/survey.xlsx",
  skip = 3,
  na   = c("", "N/A", "-", ".")
)

# List all sheet names
excel_sheets("data/crop_production.xlsx")
TipWhen Excel Dates Appear as Numbers

Excel stores dates as the number of days since January 1, 1900. If your date column imports as a numeric, use:

df$date_column <- as.Date(df$date_column, origin = "1899-12-30")

3.3 Connecting to Databases

For large datasets, it is more efficient to query a database than to load everything into memory.

Code
library(DBI)
library(RSQLite)     # Or RPostgres, RMySQL, odbc, etc.

# Connect to a SQLite database
con <- dbConnect(RSQLite::SQLite(), "data/my_database.sqlite")

# List tables
dbListTables(con)

# Query a table
df_query <- dbGetQuery(con, "SELECT * FROM crop_data WHERE year = 2023")

# Use dplyr syntax with dbplyr (lazy evaluation — SQL runs in the database)
library(dbplyr)
crop_tbl <- tbl(con, "crop_data")

result <- crop_tbl |>
  filter(year == 2023, state == "Punjab") |>
  select(crop, yield, area) |>
  collect()   # Pull results into R memory

# Always close the connection
dbDisconnect(con)

3.4 Reading Other Formats

Code
# SPSS files
library(haven)
df_spss <- read_spss("data/survey.sav")

# Stata files
df_stata <- read_stata("data/panel_data.dta")

# SAS files
df_sas <- read_sas("data/clinical_trial.sas7bdat")

# JSON
library(jsonlite)
df_json <- fromJSON("data/api_response.json", flatten = TRUE)

# Google Sheets (requires authentication)
library(googlesheets4)
gs4_auth()
df_gsheet <- read_sheet("https://docs.google.com/spreadsheets/d/YOUR_ID")

3.5 Writing Data

3.5.1 Writing Flat Files

Code
# Write CSV (tidyverse way)
write_csv(df, "output/results.csv")

# Write with NA representation
write_csv(df, "output/results.csv", na = "")

# Write TSV
write_tsv(df, "output/results.tsv")

# Base R equivalent
write.csv(df, "output/results.csv", row.names = FALSE)

3.5.2 Writing Excel Files

Code
library(writexl)

# Write a single data frame to Excel
write_xlsx(df, "output/results.xlsx")

# Write multiple data frames as separate sheets
write_xlsx(
  list(
    "Summary"  = summary_df,
    "Raw Data" = raw_df,
    "Models"   = model_results_df
  ),
  "output/full_report.xlsx"
)

3.6 Saving R Objects

Sometimes you want to save R objects — models, processed datasets, or lists — in a format that preserves all R-specific attributes.

3.6.1 .RDS Files — Single Objects

.RDS (R Data Serialisation) stores a single R object:

Code
# Save
model_fit <- lm(mpg ~ wt + hp, data = mtcars)
saveRDS(model_fit, "output/model_fit.rds")

# Load — assign to any name you want
m <- readRDS("output/model_fit.rds")
summary(m)

# RDS is ideal for:
# - Saving trained models
# - Caching slow computations
# - Passing objects between scripts

3.6.2 .RData Files — Multiple Objects

.RData stores multiple named objects from your workspace:

Code
x <- 1:10
y <- "hello"
df <- mtcars

# Save multiple objects
save(x, y, df, file = "output/my_workspace.RData")

# Save entire workspace
save.image("output/full_workspace.RData")

# Load — objects are restored with their original names
load("output/my_workspace.RData")   # x, y, df are now in your environment
Warning.RDS vs. .RData

Prefer .RDS over .RData for most purposes because:

  • .RDS is explicit: you choose the name when loading (m <- readRDS(...))
  • .RData silently overwrites existing objects with the same name on load
  • .RDS is easier to use in pipelines and package code

3.7 The here Package: Robust File Paths

Hard-coded file paths break when you move your project or collaborate with others. The here package solves this by always building paths relative to the project root.

Code
library(here)

# here() finds the project root (where the .Rproj file lives)
here()
#> [1] "/home/pawan/statistics-with-R"

# Build a path relative to the project root
here("data", "survey_results.csv")   # Works on Windows, Mac, and Linux
#> [1] "/home/pawan/statistics-with-R/data/survey_results.csv"
here("output", "figures", "plot1.png")
#> [1] "/home/pawan/statistics-with-R/output/figures/plot1.png"
TipThe Golden Rule of File Paths

Never use setwd() in a script. It makes your script non-portable. Instead, use RStudio Projects and the here package for all file paths.

3.8 Data Import Workflow

A typical data import workflow looks like this:

Code
library(tidyverse)
library(here)
library(janitor)   # For clean_names()

# 1. Import
raw <- read_csv(
  here("data", "raw", "crop_yields_2023.csv"),
  na          = c("", "NA", "-", ".."),
  show_col_types = FALSE
)

# 2. Inspect
glimpse(raw)
skimr::skim(raw)

# 3. Clean column names to snake_case
df <- raw |>
  janitor::clean_names()   # "Total Area (ha)" → "total_area_ha"

# 4. Save the cleaned version
write_csv(df, here("data", "processed", "crop_yields_2023_clean.csv"))
saveRDS(df, here("data", "processed", "crop_yields_2023_clean.rds"))

3.9 Exercises

  1. Download a CSV file of your choice from the Indian government’s open data portal (data.gov.in) and import it using read_csv(). Report on how many rows and columns it has, and what column types were inferred.

  2. Practice handling import problems: create a CSV file in a text editor where the decimal separator is a comma and the column separator is a semicolon. Import it using the appropriate readr function.

  3. Save the penguins dataset (from palmerpenguins) to an RDS file. Then load it back in a fresh R session and verify it is identical to the original using identical().

  4. Use the here package to set up a consistent file path structure for a hypothetical project with data/raw/, data/processed/, and output/ directories.

  5. Challenge: Scrape or download monthly petrol price data for India (e.g., from the PPAC website). Import it and write a brief summary report in Quarto showing price trends.