6  Descriptive Statistics and Exploratory Data Analysis

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Compute and interpret measures of central tendency and dispersion
  • Use skimr::skim() for comprehensive dataset summaries
  • Build and visualise correlation matrices
  • Conduct a structured Exploratory Data Analysis (EDA)
  • Identify outliers and data quality issues graphically and numerically

6.1 Why Exploratory Data Analysis?

Before fitting any model, you must understand your data. EDA is the practice of summarising and visualising data to uncover patterns, spot anomalies, test hypotheses, and check assumptions. Skipping EDA is one of the most common mistakes in data analysis.

A good EDA answers:

  • What are the distributions of my variables?
  • Are there outliers or implausible values?
  • How are variables related to each other?
  • Is there missing data? Where and why?

6.2 Summary Statistics

6.2.1 Base R: summary()

Code
data(airquality)
summary(airquality)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

summary() gives a quick snapshot, but it is hard to read for wide datasets and provides no information about skewness or missingness patterns.

6.2.2 The skimr Package

skimr::skim() provides a far more informative summary:

Code
library(skimr)

skim(airquality)
Data summary
Name airquality
Number of rows 153
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Ozone 37 0.76 42.13 32.99 1.0 18.00 31.5 63.25 168.0 ▇▃▂▁▁
Solar.R 7 0.95 185.93 90.06 7.0 115.75 205.0 258.75 334.0 ▅▃▅▇▅
Wind 0 1.00 9.96 3.52 1.7 7.40 9.7 11.50 20.7 ▂▇▇▃▁
Temp 0 1.00 77.88 9.47 56.0 72.00 79.0 85.00 97.0 ▂▃▇▇▃
Month 0 1.00 6.99 1.42 5.0 6.00 7.0 8.00 9.0 ▇▇▇▇▇
Day 0 1.00 15.80 8.86 1.0 8.00 16.0 23.00 31.0 ▇▇▇▇▆

skim() automatically separates numeric and categorical variables, reports missing counts, and includes mini histograms for quick distributional assessment.

6.2.3 Measures of Central Tendency

Code
ozone <- airquality$Ozone

# Mean — sensitive to outliers
mean(ozone, na.rm = TRUE)
#> [1] 42.12931

# Median — robust to outliers
median(ozone, na.rm = TRUE)
#> [1] 31.5

# Mode — no built-in function in R
mode_val <- function(x, na.rm = TRUE) {
  if (na.rm) x <- x[!is.na(x)]
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
mode_val(ozone)
#> [1] 23

# Trimmed mean (remove top/bottom 10%)
mean(ozone, trim = 0.1, na.rm = TRUE)
#> [1] 37.79787

6.2.4 Measures of Dispersion

Code
# Variance and standard deviation
var(ozone, na.rm = TRUE)
#> [1] 1088.201
sd(ozone, na.rm = TRUE)
#> [1] 32.98788

# Range
range(ozone, na.rm = TRUE)
#> [1]   1 168
diff(range(ozone, na.rm = TRUE))  # Width of range
#> [1] 167

# Interquartile Range
IQR(ozone, na.rm = TRUE)
#> [1] 45.25
quantile(ozone, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
#>   25%   50%   75% 
#> 18.00 31.50 63.25

# Coefficient of Variation (relative dispersion)
cv <- sd(ozone, na.rm = TRUE) / mean(ozone, na.rm = TRUE) * 100
cat("CV:", round(cv, 1), "%\n")
#> CV: 78.3 %

6.2.5 Skewness and Kurtosis

Code
library(moments)

skewness(ozone, na.rm = TRUE)   # > 0 = right skew
#> [1] 1.225681
kurtosis(ozone, na.rm = TRUE)   # > 3 = heavy tails (excess kurtosis = kurtosis - 3)
#> [1] 4.184071
Code
airquality |>
  drop_na(Ozone) |>
  ggplot(aes(x = Ozone)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 20, fill = "#3498db", colour = "white", alpha = 0.8) +
  geom_density(colour = "#e74c3c", linewidth = 1) +
  geom_vline(xintercept = mean(ozone, na.rm = TRUE),
             colour = "#2ecc71", linetype = "dashed", linewidth = 0.9) +
  geom_vline(xintercept = median(ozone, na.rm = TRUE),
             colour = "#f39c12", linetype = "dashed", linewidth = 0.9) +
  annotate("text", x = 155, y = 0.018, label = "Mean", colour = "#2ecc71", size = 3.5) +
  annotate("text", x = 155, y = 0.016, label = "Median", colour = "#f39c12", size = 3.5) +
  labs(title = "Distribution of Ozone Levels",
       x = "Ozone (ppb)", y = "Density") +
  theme_minimal(base_size = 13)
Figure 6.1: Ozone levels show strong positive (right) skew.

6.3 Group-wise Summaries

Code
# Add a month label
aq_labeled <- airquality |>
  mutate(
    Month_name = month.abb[Month],
    Season = case_when(
      Month %in% c(5, 6)    ~ "Late Spring",
      Month %in% c(7, 8)    ~ "Summer",
      Month == 9            ~ "Early Autumn"
    )
  )

# Group-wise summary
aq_labeled |>
  group_by(Month_name) |>
  summarise(
    n          = n(),
    n_missing  = sum(is.na(Ozone)),
    mean_ozone = round(mean(Ozone, na.rm = TRUE), 1),
    sd_ozone   = round(sd(Ozone, na.rm = TRUE), 1),
    median_ozone = median(Ozone, na.rm = TRUE),
    max_ozone  = max(Ozone, na.rm = TRUE)
  )
#> # A tibble: 5 × 7
#>   Month_name     n n_missing mean_ozone sd_ozone median_ozone max_ozone
#>   <chr>      <int>     <int>      <dbl>    <dbl>        <dbl>     <int>
#> 1 Aug           31         5       60       39.7           52       168
#> 2 Jul           31         5       59.1     31.6           60       135
#> 3 Jun           30        21       29.4     18.2           23        71
#> 4 May           31         5       23.6     22.2           18       115
#> 5 Sep           30         1       31.4     24.1           23        96
Code
aq_labeled |>
  drop_na(Ozone) |>
  ggplot(aes(x = reorder(Month_name, Month), y = Ozone, fill = Month_name)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA, show.legend = FALSE) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1.5, show.legend = FALSE) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(title = "Monthly Ozone Levels (New York, 1973)",
       x = NULL, y = "Ozone (ppb)") +
  theme_minimal(base_size = 13)
Figure 6.2: Monthly variation in ozone levels with individual observations overlaid.

6.4 Correlation Analysis

6.4.1 Computing Correlation Matrices

Code
# Select numeric columns and drop NAs
aq_numeric <- airquality |>
  select(Ozone, Solar.R, Wind, Temp) |>
  drop_na()

# Pearson correlation (default)
cor_matrix <- cor(aq_numeric)
round(cor_matrix, 3)
#>          Ozone Solar.R   Wind   Temp
#> Ozone    1.000   0.348 -0.612  0.699
#> Solar.R  0.348   1.000 -0.127  0.294
#> Wind    -0.612  -0.127  1.000 -0.497
#> Temp     0.699   0.294 -0.497  1.000

# Spearman rank correlation (robust to outliers and non-normality)
cor(aq_numeric, method = "spearman") |> round(3)
#>          Ozone Solar.R   Wind   Temp
#> Ozone    1.000   0.348 -0.605  0.773
#> Solar.R  0.348   1.000 -0.062  0.210
#> Wind    -0.605  -0.062  1.000 -0.499
#> Temp     0.773   0.210 -0.499  1.000

6.4.2 Visualising Correlations

Code
library(corrplot)

corrplot(
  cor_matrix,
  method  = "color",
  type    = "upper",
  addCoef.col = "black",
  tl.col  = "black",
  tl.srt  = 45,
  col     = colorRampPalette(c("#e74c3c", "white", "#3498db"))(200),
  title   = "Correlation Matrix: Air Quality Variables",
  mar     = c(0, 0, 1, 0)
)
Figure 6.3: Correlation matrix for the airquality dataset. Blue = positive, red = negative.
Code
library(GGally)

aq_numeric |>
  ggpairs(
    columnLabels = c("Ozone", "Solar Rad.", "Wind", "Temp"),
    upper = list(continuous = wrap("cor", size = 3.5)),
    lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),
    diag  = list(continuous = wrap("densityDiag", fill = "#3498db", alpha = 0.4))
  ) +
  theme_minimal(base_size = 10) +
  labs(title = "Pairs Plot: Air Quality Variables")
Figure 6.4: Pairs plot showing pairwise relationships and distributions.

6.5 Identifying Outliers

6.5.1 The IQR Method

An observation is a potential outlier if it falls more than 1.5 × IQR below Q1 or above Q3:

Code
identify_outliers_iqr <- function(x, na.rm = TRUE) {
  q <- quantile(x, c(0.25, 0.75), na.rm = na.rm)
  iqr <- IQR(x, na.rm = na.rm)
  lower <- q[1] - 1.5 * iqr
  upper <- q[2] + 1.5 * iqr
  list(
    lower_bound = lower,
    upper_bound = upper,
    n_outliers  = sum(x < lower | x > upper, na.rm = na.rm),
    outlier_values = x[!is.na(x) & (x < lower | x > upper)]
  )
}

identify_outliers_iqr(airquality$Ozone)
#> $lower_bound
#>     25% 
#> -49.875 
#> 
#> $upper_bound
#>     75% 
#> 131.125 
#> 
#> $n_outliers
#> [1] 2
#> 
#> $outlier_values
#> [1] 135 168

6.5.2 Z-Score Method

Code
z_scores <- scale(aq_numeric$Ozone)
outlier_idx <- which(abs(z_scores) > 3)
cat("Outliers (|Z| > 3):", aq_numeric$Ozone[outlier_idx], "\n")
#> Outliers (|Z| > 3): 168

6.6 A Structured EDA Template

Here is a reusable EDA template applied to a dataset:

Code
eda_report <- function(df, target_var = NULL) {

  cat("===== DATASET OVERVIEW =====\n")
  cat("Dimensions:", nrow(df), "rows ×", ncol(df), "columns\n")
  cat("Missing values per column:\n")
  print(colSums(is.na(df)))

  cat("\n===== NUMERIC VARIABLES =====\n")
  nums <- df |> select(where(is.numeric))
  if (ncol(nums) > 0) {
    print(skimr::skim(nums))
  }

  cat("\n===== CATEGORICAL VARIABLES =====\n")
  cats <- df |> select(where(is.character) | where(is.factor))
  if (ncol(cats) > 0) {
    for (col in names(cats)) {
      cat("\n", col, ":\n")
      print(table(cats[[col]], useNA = "always"))
    }
  }
}

eda_report(airquality)
#> ===== DATASET OVERVIEW =====
#> Dimensions: 153 rows × 6 columns
#> Missing values per column:
#>   Ozone Solar.R    Wind    Temp   Month     Day 
#>      37       7       0       0       0       0 
#> 
#> ===== NUMERIC VARIABLES =====
#> ── Data Summary ────────────────────────
#>                            Values
#> Name                       nums  
#> Number of rows             153   
#> Number of columns          6     
#> _______________________          
#> Column type frequency:           
#>   numeric                  6     
#> ________________________         
#> Group variables            None  
#> 
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#>   skim_variable n_missing complete_rate   mean    sd   p0   p25   p50   p75
#> 1 Ozone                37         0.758  42.1  33.0   1    18    31.5  63.2
#> 2 Solar.R               7         0.954 186.   90.1   7   116.  205   259. 
#> 3 Wind                  0         1       9.96  3.52  1.7   7.4   9.7  11.5
#> 4 Temp                  0         1      77.9   9.47 56    72    79    85  
#> 5 Month                 0         1       6.99  1.42  5     6     7     8  
#> 6 Day                   0         1      15.8   8.86  1     8    16    23  
#>    p100 hist 
#> 1 168   ▇▃▂▁▁
#> 2 334   ▅▃▅▇▅
#> 3  20.7 ▂▇▇▃▁
#> 4  97   ▂▃▇▇▃
#> 5   9   ▇▇▇▇▇
#> 6  31   ▇▇▇▇▆
#> 
#> ===== CATEGORICAL VARIABLES =====

6.7 Exercises

  1. Using the diamonds dataset (from ggplot2), compute the mean, median, standard deviation, and IQR of the price variable. Is the distribution skewed? How do you know?

  2. Compute group-wise summaries of diamond price by cut. Which cut has the highest mean price? Is that consistent with the highest quality?

  3. Create a correlation matrix for the numeric columns of the diamonds dataset. Which pair of variables has the strongest positive correlation?

  4. Identify outliers in diamonds$carat using the IQR method. How many outliers are there? What are their values?

  5. Challenge: Write a complete EDA report (in Quarto) for India’s state-wise crop production data. Include summary statistics, distributions, group comparisons, and a correlation matrix. Use callout boxes to highlight your key findings.