10  Mixed Effects Models

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Explain the difference between fixed and random effects
  • Identify situations where mixed models are appropriate
  • Fit linear and generalised linear mixed models using lme4
  • Interpret random intercept and random slope models
  • Extract and visualise random effects

10.1 Fixed vs. Random Effects

Fixed effects are the factors whose specific levels we care about — variety A vs. variety B, urban vs. rural, male vs. female.

Random effects are grouping factors where the specific levels are a sample from a larger population of possible levels — specific schools, villages, districts, patients, years. We model their variance rather than estimating a coefficient for each level.

Why does this matter? Standard regression assumes all observations are independent. When observations are nested within groups (students within schools, repeat measures within individuals), this assumption is violated. Mixed models account for this clustering.

10.2 When to Use Mixed Models

Mixed models are appropriate for:

  • Repeated measures: Same subject measured multiple times
  • Longitudinal data: Observations over time within units
  • Nested designs: Students in classrooms, fields in villages, patients in hospitals
  • Panel data: Cross-sectional units observed across multiple time periods

10.3 Fitting Mixed Models with lme4

10.3.1 Random Intercept Model

Code
# Simulate: crop yield measured in multiple plots within 20 villages, over 5 years
set.seed(123)
n_villages <- 20
n_years    <- 5
n_plots    <- 8

panel_data <- expand_grid(
  village = paste("V", str_pad(1:n_villages, 2, pad = "0"), sep = ""),
  year    = 2019:2023,
  plot    = 1:n_plots
) |>
  mutate(
    village_effect = rep(rnorm(n_villages, 0, 300), each = n_years * n_plots),
    rainfall_mm    = round(rnorm(n(), 800, 120)),
    fertiliser_kg  = round(runif(n(), 50, 150)),
    yield          = 3500 +
      0.8 * rainfall_mm +
      3.2 * fertiliser_kg +
      village_effect +
      rnorm(n(), 0, 250)
  )

head(panel_data)
#> # A tibble: 6 × 7
#>   village  year  plot village_effect rainfall_mm fertiliser_kg yield
#>   <chr>   <int> <int>          <dbl>       <dbl>         <dbl> <dbl>
#> 1 V01      2019     1          -168.         672            61 4170.
#> 2 V01      2019     2          -168.         774           133 4128.
#> 3 V01      2019     3          -168.         677           127 3739.
#> 4 V01      2019     4          -168.         713           144 4204.
#> 5 V01      2019     5          -168.         725            97 4125.
#> 6 V01      2019     6          -168.         598           144 4485.
Code
# Random intercept: each village gets its own intercept
m_ri <- lmer(yield ~ rainfall_mm + fertiliser_kg + (1 | village),
             data = panel_data, REML = TRUE)

summary(m_ri)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: yield ~ rainfall_mm + fertiliser_kg + (1 | village)
#>    Data: panel_data
#> 
#> REML criterion at convergence: 11195.4
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.9310 -0.6573  0.0170  0.6912  3.4135 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  village  (Intercept) 83750    289.4   
#>  Residual             64133    253.2   
#> Number of obs: 800, groups:  village, 20
#> 
#> Fixed effects:
#>                Estimate Std. Error t value
#> (Intercept)   3.673e+03  9.496e+01  38.680
#> rainfall_mm   6.898e-01  7.686e-02   8.975
#> fertiliser_kg 2.893e+00  3.080e-01   9.395
#> 
#> Correlation of Fixed Effects:
#>             (Intr) rnfll_
#> rainfall_mm -0.649       
#> fertilsr_kg -0.326  0.002
Code
# Fixed effects
fixef(m_ri)
#>   (Intercept)   rainfall_mm fertiliser_kg 
#>  3673.1289051     0.6898356     2.8933117

# Random effects (BLUPs — Best Linear Unbiased Predictors)
ranef(m_ri)$village |> head(8)
#>     (Intercept)
#> V01  -209.70195
#> V02   -95.20523
#> V03   377.98844
#> V04   -51.76045
#> V05    12.94438
#> V06   474.33623
#> V07   103.15859
#> V08  -381.40693

# Intraclass Correlation Coefficient
performance::icc(m_ri)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.566
#>   Unadjusted ICC: 0.518

The ICC tells us what proportion of total variance is due to village-level clustering. A high ICC (> 0.1) justifies a mixed model.

10.3.2 Random Slope Model

Code
# Random intercept AND slope: fertiliser effect varies by village
m_rs <- lmer(yield ~ rainfall_mm + fertiliser_kg + (1 + fertiliser_kg | village),
             data = panel_data, REML = TRUE)

# Compare models
anova(m_ri, m_rs)
#> Data: panel_data
#> Models:
#> m_ri: yield ~ rainfall_mm + fertiliser_kg + (1 | village)
#> m_rs: yield ~ rainfall_mm + fertiliser_kg + (1 + fertiliser_kg | village)
#>      npar   AIC   BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)
#> m_ri    5 11212 11235 -5600.9     11202                     
#> m_rs    7 11216 11248 -5600.8     11202 0.0534  2     0.9737
Code
ranef_df <- ranef(m_ri)$village |>
  rownames_to_column("village") |>
  rename(intercept = `(Intercept)`) |>
  arrange(intercept)

ggplot(ranef_df, aes(x = intercept, y = reorder(village, intercept))) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = intercept - 1.96 * 50, xmax = intercept + 1.96 * 50),
                 height = 0.3, alpha = 0.5) +
  geom_point(colour = "#3498db", size = 2.5) +
  labs(title = "Village-Level Random Intercepts",
       x = "Deviation from Grand Mean (kg/ha)", y = "Village") +
  theme_minimal(base_size = 11)
Figure 10.1: Random intercepts for each village — some consistently produce higher yields.

10.3.3 Generalised Linear Mixed Models (GLMM)

For binary or count outcomes with clustering:

Code
# Binary outcome: whether a plot achieved above-median yield
panel_data <- panel_data |>
  mutate(high_yield = as.integer(yield > median(yield)))

m_glmm <- glmer(
  high_yield ~ rainfall_mm + fertiliser_kg + (1 | village),
  data   = panel_data,
  family = binomial(link = "logit")
)

broom.mixed::tidy(m_glmm, exponentiate = TRUE, conf.int = TRUE) |>
  filter(effect == "fixed")
#> # A tibble: 3 × 9
#>   effect group term     estimate std.error statistic  p.value conf.low conf.high
#>   <chr>  <chr> <chr>       <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 fixed  <NA>  (Interc…  0.00484  0.00416      -6.21 5.20e-10 0.000902    0.0260
#> 2 fixed  <NA>  rainfal…  1.00     0.000819      5.09 3.63e- 7 1.00        1.01  
#> 3 fixed  <NA>  fertili…  1.02     0.00331       5.96 2.48e- 9 1.01        1.03

10.4 Exercises

  1. Simulate a panel dataset with 15 schools, 30 students per school, and a student-level test score outcome. Fit a random intercept model with student characteristics as fixed effects. Report the ICC.

  2. Using the sleepstudy dataset from lme4 (reaction time over days of sleep deprivation), fit a random slope model where the effect of days can vary by subject. Is there significant between-subject variation in the sleep deprivation effect?

  3. Compare a standard linear regression (ignoring clustering) with a mixed model on the same data. How do the standard errors differ? What are the implications for inference?