4  Data Manipulation with dplyr and tidyr

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Understand the philosophy of tidy data
  • Use the six core dplyr verbs fluently
  • Chain operations with the pipe operator (|>)
  • Perform grouped summaries with group_by() and summarise()
  • Reshape data between wide and long formats with tidyr
  • Use data.table for high-performance operations on large datasets

4.1 The Tidyverse Philosophy: Tidy Data

Tidy data is a consistent way of organising tabular data that makes analysis much easier. Hadley Wickham formalised the concept in his 2014 paper. The three rules of tidy data are:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Most real-world data violates at least one of these rules. The tidyverse is designed to make it easy to tidy messy data.

Code
# Messy: crop yields spread across year columns
messy_yield <- tibble(
  state   = c("Punjab", "Haryana", "UP"),
  crop    = c("Wheat", "Wheat", "Rice"),
  `2021`  = c(48.2, 45.1, 30.5),
  `2022`  = c(49.5, 46.3, 31.2),
  `2023`  = c(50.1, 47.8, 32.1)
)
print(messy_yield)
#> # A tibble: 3 × 5
#>   state   crop  `2021` `2022` `2023`
#>   <chr>   <chr>  <dbl>  <dbl>  <dbl>
#> 1 Punjab  Wheat   48.2   49.5   50.1
#> 2 Haryana Wheat   45.1   46.3   47.8
#> 3 UP      Rice    30.5   31.2   32.1

# Tidy: each row is one observation (state × crop × year)
tidy_yield <- messy_yield |>
  pivot_longer(
    cols      = `2021`:`2023`,
    names_to  = "year",
    values_to = "yield_qtl_ha"
  ) |>
  mutate(year = as.integer(year))
print(tidy_yield)
#> # A tibble: 9 × 4
#>   state   crop   year yield_qtl_ha
#>   <chr>   <chr> <int>        <dbl>
#> 1 Punjab  Wheat  2021         48.2
#> 2 Punjab  Wheat  2022         49.5
#> 3 Punjab  Wheat  2023         50.1
#> 4 Haryana Wheat  2021         45.1
#> 5 Haryana Wheat  2022         46.3
#> 6 Haryana Wheat  2023         47.8
#> 7 UP      Rice   2021         30.5
#> 8 UP      Rice   2022         31.2
#> 9 UP      Rice   2023         32.1

4.2 The Pipe Operator

The pipe operator |> (base R, version 4.1+) passes the result on its left as the first argument to the function on its right. This enables readable, left-to-right chains of operations.

Code
# Without pipe: nested, reads inside-out
result <- round(mean(sqrt(c(4, 9, 16, 25))), 2)

# With pipe: linear, reads left-to-right
result <- c(4, 9, 16, 25) |>
  sqrt() |>
  mean() |>
  round(2)

print(result)
#> [1] 3.5
Tip|> vs %>%

The |> is the native R pipe introduced in R 4.1.0. The %>% is the magrittr pipe (from the magrittr package, loaded with dplyr). They are nearly identical for most uses. This book uses |> throughout to avoid loading an extra dependency.

The main difference: |> always passes to the first argument; %>% allows using . as a placeholder for other positions.

4.3 Core dplyr Verbs

dplyr provides six core verbs that cover the vast majority of data manipulation needs.

4.3.1 filter() — Subset Rows

Code
data(gapminder, package = "gapminder")

# Single condition
india <- gapminder |>
  filter(country == "India")

# Multiple conditions (AND)
south_asia_2007 <- gapminder |>
  filter(continent == "Asia", year == 2007, lifeExp > 70)

# OR condition
india_china <- gapminder |>
  filter(country %in% c("India", "China"))

# Negation
not_africa <- gapminder |>
  filter(continent != "Africa")

# Filter with string matching
# filter(str_detect(country, "^I"))   # Countries starting with I

nrow(india)
#> [1] 12
nrow(south_asia_2007)
#> [1] 22

4.3.2 select() — Subset Columns

Code
# Select by name
gapminder |>
  select(country, year, lifeExp) |>
  head(4)
#> # A tibble: 4 × 3
#>   country      year lifeExp
#>   <fct>       <int>   <dbl>
#> 1 Afghanistan  1952    28.8
#> 2 Afghanistan  1957    30.3
#> 3 Afghanistan  1962    32.0
#> 4 Afghanistan  1967    34.0

# Select a range
gapminder |>
  select(country:lifeExp) |>
  head(4)
#> # A tibble: 4 × 4
#>   country     continent  year lifeExp
#>   <fct>       <fct>     <int>   <dbl>
#> 1 Afghanistan Asia       1952    28.8
#> 2 Afghanistan Asia       1957    30.3
#> 3 Afghanistan Asia       1962    32.0
#> 4 Afghanistan Asia       1967    34.0

# Drop columns with -
gapminder |>
  select(-continent) |>
  head(4)
#> # A tibble: 4 × 5
#>   country      year lifeExp      pop gdpPercap
#>   <fct>       <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan  1952    28.8  8425333      779.
#> 2 Afghanistan  1957    30.3  9240934      821.
#> 3 Afghanistan  1962    32.0 10267083      853.
#> 4 Afghanistan  1967    34.0 11537966      836.

# Helper functions
gapminder |>
  select(starts_with("c")) |>
  head(4)
#> # A tibble: 4 × 2
#>   country     continent
#>   <fct>       <fct>    
#> 1 Afghanistan Asia     
#> 2 Afghanistan Asia     
#> 3 Afghanistan Asia     
#> 4 Afghanistan Asia

gapminder |>
  select(ends_with("p")) |>
  head(4)
#> # A tibble: 4 × 3
#>   lifeExp      pop gdpPercap
#>     <dbl>    <int>     <dbl>
#> 1    28.8  8425333      779.
#> 2    30.3  9240934      821.
#> 3    32.0 10267083      853.
#> 4    34.0 11537966      836.

gapminder |>
  select(where(is.numeric)) |>
  head(4)
#> # A tibble: 4 × 4
#>    year lifeExp      pop gdpPercap
#>   <int>   <dbl>    <int>     <dbl>
#> 1  1952    28.8  8425333      779.
#> 2  1957    30.3  9240934      821.
#> 3  1962    32.0 10267083      853.
#> 4  1967    34.0 11537966      836.

4.3.3 mutate() — Create or Modify Columns

Code
gapminder_ext <- gapminder |>
  mutate(
    gdp_total      = gdpPercap * pop,               # New column
    log_gdp        = log(gdpPercap),                # Log transformation
    lifeExp_decade = lifeExp %/% 10 * 10,           # Decade bins
    high_income    = gdpPercap > median(gdpPercap), # Logical column
    pop_millions   = pop / 1e6                      # Rescale
  )

gapminder_ext |>
  select(country, year, gdpPercap, gdp_total, high_income) |>
  head(5)
#> # A tibble: 5 × 5
#>   country      year gdpPercap   gdp_total high_income
#>   <fct>       <int>     <dbl>       <dbl> <lgl>      
#> 1 Afghanistan  1952      779. 6567086330. FALSE      
#> 2 Afghanistan  1957      821. 7585448670. FALSE      
#> 3 Afghanistan  1962      853. 8758855797. FALSE      
#> 4 Afghanistan  1967      836. 9648014150. FALSE      
#> 5 Afghanistan  1972      740. 9678553274. FALSE

4.3.4 arrange() — Sort Rows

Code
# Ascending (default)
gapminder |>
  filter(year == 2007) |>
  arrange(lifeExp) |>
  head(5)
#> # A tibble: 5 × 6
#>   country      continent  year lifeExp      pop gdpPercap
#>   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Swaziland    Africa     2007    39.6  1133066     4513.
#> 2 Mozambique   Africa     2007    42.1 19951656      824.
#> 3 Zambia       Africa     2007    42.4 11746035     1271.
#> 4 Sierra Leone Africa     2007    42.6  6144562      863.
#> 5 Lesotho      Africa     2007    42.6  2012649     1569.

# Descending
gapminder |>
  filter(year == 2007) |>
  arrange(desc(gdpPercap)) |>
  head(5)
#> # A tibble: 5 × 6
#>   country       continent  year lifeExp       pop gdpPercap
#>   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
#> 1 Norway        Europe     2007    80.2   4627926    49357.
#> 2 Kuwait        Asia       2007    77.6   2505559    47307.
#> 3 Singapore     Asia       2007    80.0   4553009    47143.
#> 4 United States Americas   2007    78.2 301139947    42952.
#> 5 Ireland       Europe     2007    78.9   4109086    40676.

# Multiple sort keys
gapminder |>
  arrange(continent, desc(year), lifeExp) |>
  head(5)
#> # A tibble: 5 × 6
#>   country      continent  year lifeExp      pop gdpPercap
#>   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Swaziland    Africa     2007    39.6  1133066     4513.
#> 2 Mozambique   Africa     2007    42.1 19951656      824.
#> 3 Zambia       Africa     2007    42.4 11746035     1271.
#> 4 Sierra Leone Africa     2007    42.6  6144562      863.
#> 5 Lesotho      Africa     2007    42.6  2012649     1569.

4.3.5 summarise() — Compute Summary Statistics

Code
gapminder |>
  filter(year == 2007) |>
  summarise(
    n_countries    = n(),
    mean_lifeExp   = mean(lifeExp),
    median_lifeExp = median(lifeExp),
    sd_lifeExp     = sd(lifeExp),
    min_lifeExp    = min(lifeExp),
    max_lifeExp    = max(lifeExp)
  )
#> # A tibble: 1 × 6
#>   n_countries mean_lifeExp median_lifeExp sd_lifeExp min_lifeExp max_lifeExp
#>         <int>        <dbl>          <dbl>      <dbl>       <dbl>       <dbl>
#> 1         142         67.0           71.9       12.1        39.6        82.6

4.3.6 group_by() + summarise() — Grouped Summaries

This combination is the workhorse of data analysis — compute statistics within groups:

Code
gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarise(
    n_countries  = n(),
    mean_gdp     = mean(gdpPercap),
    mean_lifeExp = mean(lifeExp),
    total_pop    = sum(pop)
  ) |>
  arrange(desc(mean_lifeExp))
#> # A tibble: 5 × 5
#>   continent n_countries mean_gdp mean_lifeExp  total_pop
#>   <fct>           <int>    <dbl>        <dbl>      <dbl>
#> 1 Oceania             2   29810.         80.7   24549947
#> 2 Europe             30   25054.         77.6  586098529
#> 3 Americas           25   11003.         73.6  898871184
#> 4 Asia               33   12473.         70.7 3811953827
#> 5 Africa             52    3089.         54.8  929539692
Code
# group_by + mutate: compute relative to group (not summarise)
gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  mutate(
    gdp_rel_to_continent = gdpPercap / mean(gdpPercap),
    gdp_rank_in_continent = rank(desc(gdpPercap))
  ) |>
  filter(country %in% c("India", "China", "Japan")) |>
  select(country, continent, gdpPercap, gdp_rel_to_continent, gdp_rank_in_continent)
#> # A tibble: 3 × 5
#> # Groups:   continent [1]
#>   country continent gdpPercap gdp_rel_to_continent gdp_rank_in_continent
#>   <fct>   <fct>         <dbl>                <dbl>                 <dbl>
#> 1 China   Asia          4959.                0.398                    15
#> 2 India   Asia          2452.                0.197                    25
#> 3 Japan   Asia         31656.                2.54                      4

4.3.7 rename() and relocate()

Code
gapminder |>
  rename(
    gdp_per_capita = gdpPercap,
    life_expectancy = lifeExp,
    population = pop
  ) |>
  relocate(population, .before = life_expectancy) |>
  head(3)
#> # A tibble: 3 × 6
#>   country     continent  year population life_expectancy gdp_per_capita
#>   <fct>       <fct>     <int>      <int>           <dbl>          <dbl>
#> 1 Afghanistan Asia       1952    8425333            28.8           779.
#> 2 Afghanistan Asia       1957    9240934            30.3           821.
#> 3 Afghanistan Asia       1962   10267083            32.0           853.

4.4 Joining Data Frames

Real analysis almost always involves combining multiple tables. dplyr provides a family of join functions:

Code
# Example: join crop yields with state population data
crop_yield <- tibble(
  state  = c("Punjab", "Haryana", "UP", "MP"),
  crop   = c("Wheat", "Wheat", "Rice", "Soybean"),
  yield  = c(50.2, 47.8, 32.1, 12.5)
)

state_pop <- tibble(
  state      = c("Punjab", "Haryana", "UP", "Rajasthan"),
  population = c(27.7, 25.4, 199.8, 68.5)  # millions
)

# Inner join: only rows with matches in BOTH tables
inner_join(crop_yield, state_pop, by = "state")
#> # A tibble: 3 × 4
#>   state   crop  yield population
#>   <chr>   <chr> <dbl>      <dbl>
#> 1 Punjab  Wheat  50.2       27.7
#> 2 Haryana Wheat  47.8       25.4
#> 3 UP      Rice   32.1      200.

# Left join: all rows from crop_yield, NA for non-matches
left_join(crop_yield, state_pop, by = "state")
#> # A tibble: 4 × 4
#>   state   crop    yield population
#>   <chr>   <chr>   <dbl>      <dbl>
#> 1 Punjab  Wheat    50.2       27.7
#> 2 Haryana Wheat    47.8       25.4
#> 3 UP      Rice     32.1      200. 
#> 4 MP      Soybean  12.5       NA

# Full join: all rows from both tables
full_join(crop_yield, state_pop, by = "state")
#> # A tibble: 5 × 4
#>   state     crop    yield population
#>   <chr>     <chr>   <dbl>      <dbl>
#> 1 Punjab    Wheat    50.2       27.7
#> 2 Haryana   Wheat    47.8       25.4
#> 3 UP        Rice     32.1      200. 
#> 4 MP        Soybean  12.5       NA  
#> 5 Rajasthan <NA>     NA         68.5

4.5 Reshaping Data with tidyr

4.5.1 pivot_longer() — Wide to Long

Code
# Monthly rainfall data (wide format — common in reports)
rainfall_wide <- tibble(
  station = c("Ludhiana", "Chandigarh", "Amritsar"),
  Jan = c(28, 32, 25),
  Feb = c(35, 38, 30),
  Mar = c(22, 19, 18),
  Apr = c(8, 7, 6)
)

print(rainfall_wide)
#> # A tibble: 3 × 5
#>   station      Jan   Feb   Mar   Apr
#>   <chr>      <dbl> <dbl> <dbl> <dbl>
#> 1 Ludhiana      28    35    22     8
#> 2 Chandigarh    32    38    19     7
#> 3 Amritsar      25    30    18     6

# Convert to long (tidy) format
rainfall_long <- rainfall_wide |>
  pivot_longer(
    cols      = Jan:Apr,
    names_to  = "month",
    values_to = "rainfall_mm"
  )

print(rainfall_long)
#> # A tibble: 12 × 3
#>    station    month rainfall_mm
#>    <chr>      <chr>       <dbl>
#>  1 Ludhiana   Jan            28
#>  2 Ludhiana   Feb            35
#>  3 Ludhiana   Mar            22
#>  4 Ludhiana   Apr             8
#>  5 Chandigarh Jan            32
#>  6 Chandigarh Feb            38
#>  7 Chandigarh Mar            19
#>  8 Chandigarh Apr             7
#>  9 Amritsar   Jan            25
#> 10 Amritsar   Feb            30
#> 11 Amritsar   Mar            18
#> 12 Amritsar   Apr             6

4.5.2 pivot_wider() — Long to Wide

Code
# Convert back to wide (useful for creating summary tables)
rainfall_long |>
  pivot_wider(
    names_from  = month,
    values_from = rainfall_mm
  )
#> # A tibble: 3 × 5
#>   station      Jan   Feb   Mar   Apr
#>   <chr>      <dbl> <dbl> <dbl> <dbl>
#> 1 Ludhiana      28    35    22     8
#> 2 Chandigarh    32    38    19     7
#> 3 Amritsar      25    30    18     6

# A more complex example: multiple variables
panel_long <- tibble(
  state = rep(c("Punjab", "Haryana"), each = 3),
  year  = rep(2021:2023, 2),
  gdp   = c(6.5, 6.8, 7.1, 5.2, 5.4, 5.7),
  pop   = c(27.5, 27.7, 27.9, 25.2, 25.4, 25.6)
)

panel_wide <- panel_long |>
  pivot_wider(
    names_from  = year,
    values_from = c(gdp, pop),
    names_sep   = "_"
  )

print(panel_wide)
#> # A tibble: 2 × 7
#>   state   gdp_2021 gdp_2022 gdp_2023 pop_2021 pop_2022 pop_2023
#>   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 Punjab       6.5      6.8      7.1     27.5     27.7     27.9
#> 2 Haryana      5.2      5.4      5.7     25.2     25.4     25.6

4.5.3 Handling Missing Values

Code
df_with_na <- tibble(
  x = c(1, NA, 3, NA, 5),
  y = c("a", "b", NA, "d", "e")
)

# Detect missing
is.na(df_with_na$x)
#> [1] FALSE  TRUE FALSE  TRUE FALSE
sum(is.na(df_with_na$x))   # Count NAs
#> [1] 2

# Drop rows with any NA
df_with_na |> drop_na()
#> # A tibble: 2 × 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     5 e

# Drop rows with NA in specific columns
df_with_na |> drop_na(x)
#> # A tibble: 3 × 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     3 <NA> 
#> 3     5 e

# Fill NAs
df_with_na |>
  mutate(x_filled = replace_na(x, 0))
#> # A tibble: 5 × 3
#>       x y     x_filled
#>   <dbl> <chr>    <dbl>
#> 1     1 a            1
#> 2    NA b            0
#> 3     3 <NA>         3
#> 4    NA d            0
#> 5     5 e            5

# Fill forward/backward
df_with_na |>
  mutate(x_forward = tidyr::fill(pick(x), x)$x)
#> # A tibble: 5 × 3
#>       x y     x_forward
#>   <dbl> <chr>     <dbl>
#> 1     1 a             1
#> 2    NA b             1
#> 3     3 <NA>          3
#> 4    NA d             3
#> 5     5 e             5

4.6 data.table for Large Datasets

For datasets with millions of rows, data.table offers dramatically faster performance than dplyr.

Code
library(data.table)

# Convert to data.table
mtcars_dt <- as.data.table(mtcars)

# Syntax: DT[i, j, by]
# i = row filter, j = column operations, by = grouping

# Filter rows
mtcars_dt[cyl == 6]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 4:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 5:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> 6:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4
#> 7:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6

# Select columns
mtcars_dt[, .(mpg, cyl, hp)]
#>       mpg   cyl    hp
#>     <num> <num> <num>
#>  1:  21.0     6   110
#>  2:  21.0     6   110
#>  3:  22.8     4    93
#>  4:  21.4     6   110
#>  5:  18.7     8   175
#>  6:  18.1     6   105
#>  7:  14.3     8   245
#>  8:  24.4     4    62
#>  9:  22.8     4    95
#> 10:  19.2     6   123
#> 11:  17.8     6   123
#> 12:  16.4     8   180
#> 13:  17.3     8   180
#> 14:  15.2     8   180
#> 15:  10.4     8   205
#> 16:  10.4     8   215
#> 17:  14.7     8   230
#> 18:  32.4     4    66
#> 19:  30.4     4    52
#> 20:  33.9     4    65
#> 21:  21.5     4    97
#> 22:  15.5     8   150
#> 23:  15.2     8   150
#> 24:  13.3     8   245
#> 25:  19.2     8   175
#> 26:  27.3     4    66
#> 27:  26.0     4    91
#> 28:  30.4     4   113
#> 29:  15.8     8   264
#> 30:  19.7     6   175
#> 31:  15.0     8   335
#> 32:  21.4     4   109
#>       mpg   cyl    hp

# Add/modify column
mtcars_dt[, mpg_per_hp := mpg / hp]

# Group by and summarise
mtcars_dt[, .(mean_mpg = mean(mpg), count = .N), by = cyl]
#>      cyl mean_mpg count
#>    <num>    <num> <int>
#> 1:     6 19.74286     7
#> 2:     4 26.66364    11
#> 3:     8 15.10000    14
TipWhen to Use data.table
  • Datasets with >1 million rows: data.table is 5–50× faster than dplyr
  • Memory-constrained environments: data.table modifies objects in-place
  • Production pipelines requiring maximum speed

For typical research datasets (<100,000 rows), dplyr’s more readable syntax is preferred.

4.7 A Complete Data Wrangling Pipeline

Code
# Simulate a realistic messy dataset
set.seed(42)
raw_data <- tibble(
  district   = rep(c("Ludhiana", "Amritsar", "Jalandhar", "Patiala"), each = 5),
  year       = rep(2019:2023, 4),
  wheat_prod = round(runif(20, 800, 1500)),
  rice_prod  = round(runif(20, 200, 600)),
  area_ha    = round(runif(20, 100, 300)),
  notes      = sample(c("verified", "estimated", NA), 20, replace = TRUE)
)

# Full wrangling pipeline
clean_data <- raw_data |>
  # 1. Filter recent years only
  filter(year >= 2021) |>
  # 2. Handle missing notes
  mutate(notes = replace_na(notes, "unknown")) |>
  # 3. Create derived variables
  mutate(
    total_prod      = wheat_prod + rice_prod,
    productivity    = total_prod / area_ha,
    high_yield      = productivity > median(productivity)
  ) |>
  # 4. Reshape to long format
  pivot_longer(
    cols      = ends_with("_prod"),
    names_to  = "crop",
    values_to = "production"
  ) |>
  mutate(crop = str_remove(crop, "_prod")) |>
  # 5. Summarise by district and crop
  group_by(district, crop) |>
  summarise(
    mean_production = round(mean(production), 1),
    total_years     = n(),
    .groups         = "drop"
  ) |>
  # 6. Sort
  arrange(crop, desc(mean_production))

print(clean_data)
#> # A tibble: 12 × 4
#>    district  crop  mean_production total_years
#>    <chr>     <chr>           <dbl>       <int>
#>  1 Amritsar  rice             492.           3
#>  2 Ludhiana  rice             469.           3
#>  3 Patiala   rice             430.           3
#>  4 Jalandhar rice             344.           3
#>  5 Ludhiana  total           1679.           3
#>  6 Amritsar  total           1641            3
#>  7 Jalandhar total           1529.           3
#>  8 Patiala   total           1499            3
#>  9 Ludhiana  wheat           1210            3
#> 10 Jalandhar wheat           1186.           3
#> 11 Amritsar  wheat           1149.           3
#> 12 Patiala   wheat           1069.           3

4.8 Exercises

  1. Using the gapminder dataset, find the five countries with the highest GDP per capita in 2007. Show only the country name, continent, and GDP per capita.

  2. Compute the average life expectancy per continent for each decade (1950s, 1960s, …, 2000s). Hint: Create a decade variable using integer division.

  3. The following dataset is in wide format. Convert it to tidy (long) format, then compute the mean score per student:

scores <- tibble(
  student = c("Aarav", "Bhavna", "Chirag"),
  maths   = c(85, 92, 78),
  english = c(72, 88, 95),
  science = c(90, 76, 83)
)
  1. Join the crop_yield and state_pop tibbles from Section 4.4 using a right_join(). What is the result? How does it differ from left_join()?

  2. Challenge: Download state-wise crop production data from data.gov.in. Import, clean (handle NAs, standardise column names), reshape to tidy format, and compute the top 5 producing states for each crop in the most recent year available.