9  Generalized Linear Models

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Explain why linear regression fails for binary and count outcomes
  • Fit and interpret logistic regression models
  • Compute and interpret odds ratios
  • Fit Poisson regression for count data
  • Assess model fit with deviance and AIC

9.1 Why GLMs?

Linear regression assumes the outcome is continuous and normally distributed. Many real outcomes violate this:

  • Binary outcomes (yes/no, success/failure) → Logistic Regression
  • Count outcomes (number of events, hospital visits) → Poisson Regression
  • Rate outcomes (events per person-year) → Poisson with offset

Generalized Linear Models (GLMs) unify these cases through: 1. A distribution (Normal, Binomial, Poisson, …) 2. A link function connecting the linear predictor to the outcome mean

9.2 Logistic Regression

For a binary outcome \(Y \in \{0, 1\}\), we model the log-odds:

\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]

Code
# Simulate: probability of adopting improved seeds depends on education and income
set.seed(42)
n <- 300

farm_data <- tibble(
  education_yrs = round(rnorm(n, mean = 8, sd = 3), 0) |> pmax(0),
  income_lakh   = round(runif(n, 1, 15), 1),
  access_market = rbinom(n, 1, 0.6)
) |>
  mutate(
    log_odds = -3.5 + 0.25 * education_yrs + 0.12 * income_lakh + 0.8 * access_market,
    prob     = plogis(log_odds),
    adopted  = rbinom(n, 1, prob)
  )

table(farm_data$adopted)
#> 
#>   0   1 
#> 170 130
mean(farm_data$adopted)
#> [1] 0.4333333
Code
m_logit <- glm(
  adopted ~ education_yrs + income_lakh + access_market,
  data   = farm_data,
  family = binomial(link = "logit")
)

summary(m_logit)
#> 
#> Call:
#> glm(formula = adopted ~ education_yrs + income_lakh + access_market, 
#>     family = binomial(link = "logit"), data = farm_data)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   -4.11720    0.57978  -7.101 1.24e-12 ***
#> education_yrs  0.26643    0.04931   5.403 6.56e-08 ***
#> income_lakh    0.13649    0.03375   4.044 5.25e-05 ***
#> access_market  0.91009    0.27721   3.283  0.00103 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 410.54  on 299  degrees of freedom
#> Residual deviance: 347.33  on 296  degrees of freedom
#> AIC: 355.33
#> 
#> Number of Fisher Scoring iterations: 4

9.2.1 Interpreting Logistic Regression

Raw coefficients are in log-odds, which are hard to interpret. We exponentiate to get odds ratios:

Code
# Tidy with confidence intervals
tidy_logit <- tidy(m_logit, exponentiate = TRUE, conf.int = TRUE)
print(tidy_logit)
#> # A tibble: 4 × 7
#>   term          estimate std.error statistic  p.value conf.low conf.high
#>   <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)     0.0163    0.580      -7.10 1.24e-12  0.00495    0.0484
#> 2 education_yrs   1.31      0.0493      5.40 6.56e- 8  1.19       1.44  
#> 3 income_lakh     1.15      0.0338      4.04 5.25e- 5  1.07       1.23  
#> 4 access_market   2.48      0.277       3.28 1.03e- 3  1.45       4.32
Code
tidy_logit |>
  filter(term != "(Intercept)") |>
  mutate(term = recode(term,
                       "education_yrs" = "Years of Education",
                       "income_lakh"   = "Income (Lakh ₹)",
                       "access_market" = "Market Access")) |>
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_point(size = 3.5, colour = "#e74c3c") +
  scale_x_log10() +
  labs(title = "Odds Ratios: Adoption of Improved Seeds",
       x = "Odds Ratio (log scale)", y = NULL) +
  theme_minimal(base_size = 13)
Figure 9.1: Odds ratios with 95% confidence intervals for adoption of improved seeds.
TipInterpreting Odds Ratios
  • OR = 1: No association
  • OR > 1: Predictor increases the odds of the outcome
  • OR < 1: Predictor decreases the odds of the outcome

“Farmers with market access have 2.48 times the odds of adopting improved seeds, compared to those without market access, holding other variables constant.”

9.2.2 Predicted Probabilities

Code
# Create prediction grid
pred_grid <- expand_grid(
  income_lakh   = seq(1, 15, by = 0.5),
  access_market = c(0, 1),
  education_yrs = mean(farm_data$education_yrs)
) |>
  mutate(
    pred_prob = predict(m_logit, newdata = pick(everything()), type = "response"),
    market    = factor(access_market, labels = c("No Access", "Market Access"))
  )

ggplot(pred_grid, aes(x = income_lakh, y = pred_prob, colour = market)) +
  geom_line(linewidth = 1.3) +
  scale_colour_brewer(palette = "Set1") +
  labs(title   = "Predicted Probability of Seed Adoption",
       subtitle = paste0("Education held at mean (", round(mean(farm_data$education_yrs), 1), " years)"),
       x = "Income (Lakh ₹)", y = "Predicted Probability", colour = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal(base_size = 13)
Figure 9.2: Predicted probability of seed adoption by income level and market access.

9.2.3 Model Fit

Code
# Null and residual deviance
glance(m_logit)
#> # A tibble: 1 × 8
#>   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
#>           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#> 1          411.     299  -174.  355.  370.     347.         296   300

# McFadden's pseudo R-squared
mcfadden_r2 <- 1 - (m_logit$deviance / m_logit$null.deviance)
cat("McFadden R²:", round(mcfadden_r2, 3), "\n")
#> McFadden R²: 0.154

# Classification accuracy
threshold <- 0.5
predicted_class <- as.integer(fitted(m_logit) >= threshold)
conf_matrix <- table(Predicted = predicted_class, Actual = farm_data$adopted)
print(conf_matrix)
#>          Actual
#> Predicted   0   1
#>         0 135  51
#>         1  35  79
accuracy <- mean(predicted_class == farm_data$adopted)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")
#> Accuracy: 71.3 %

9.3 Poisson Regression

For count outcomes (non-negative integers), Poisson regression models the log of the expected count:

\[\log(\mu_i) = \beta_0 + \beta_1 X_{1i} + \cdots\]

Code
# Hospital admissions per month in different districts
set.seed(7)
hospital_data <- tibble(
  district      = rep(paste("District", 1:30), each = 12),
  month         = rep(1:12, 30),
  n_admissions  = rpois(360, lambda = exp(2.8 + 0.04 * rep(1:12, 30))),
  temperature   = round(15 + 10 * sin(2 * pi * (1:360) / 12) + rnorm(360, 0, 2), 1),
  urban         = rbinom(360, 1, 0.4)
)

head(hospital_data)
#> # A tibble: 6 × 5
#>   district   month n_admissions temperature urban
#>   <chr>      <int>        <int>       <dbl> <int>
#> 1 District 1     1           26        22.2     0
#> 2 District 1     2           12        24.7     1
#> 3 District 1     3           22        24.3     1
#> 4 District 1     4           27        23.4     1
#> 5 District 1     5           19        20.7     1
#> 6 District 1     6           17        16.6     1
Code
m_poisson <- glm(
  n_admissions ~ temperature + urban + factor(month),
  data   = hospital_data,
  family = poisson(link = "log")
)

# Exponentiated coefficients are Incidence Rate Ratios (IRR)
tidy(m_poisson, exponentiate = TRUE, conf.int = TRUE) |>
  filter(!str_detect(term, "month")) |>
  select(term, estimate, conf.low, conf.high, p.value)
#> # A tibble: 3 × 5
#>   term        estimate conf.low conf.high   p.value
#>   <chr>          <dbl>    <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    17.1    13.4       21.8  9.96e-116
#> 2 temperature     1.00    0.989      1.01 8.73e-  1
#> 3 urban           1.01    0.965      1.06 6.63e-  1

9.3.1 Checking for Overdispersion

A key assumption of Poisson regression is that the mean equals the variance. When the variance exceeds the mean (overdispersion), use negative binomial regression instead.

Code
# Dispersion test
dispersion_ratio <- sum(residuals(m_poisson, type = "pearson")^2) / m_poisson$df.residual
cat("Dispersion ratio:", round(dispersion_ratio, 3),
    "\n(Values >> 1 indicate overdispersion)\n")
#> Dispersion ratio: 0.97 
#> (Values >> 1 indicate overdispersion)

# If overdispersed, use negative binomial
library(MASS)
m_negbin <- MASS::glm.nb(n_admissions ~ temperature + urban, data = hospital_data)

9.4 Regression Table: All Models

Code
modelsummary(
  list("Logistic (Adoption)" = m_logit, "Poisson (Admissions)" = m_poisson),
  exponentiate = TRUE,
  stars        = TRUE,
  gof_map      = c("nobs", "AIC", "BIC"),
  coef_omit    = "month"
)
Table 9.1: Summary of GLM specifications. Logistic uses odds ratios; Poisson uses IRR.
Logistic (Adoption) Poisson (Admissions)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 0.016*** 17.088***
(0.009) (2.121)
education_yrs 1.305***
(0.064)
income_lakh 1.146***
(0.039)
access_market 2.485**
(0.689)
temperature 1.001
(0.006)
urban 1.010
(0.024)
Num.Obs. 300 360

9.5 Exercises

  1. Using the Titanic dataset (or titanic package), fit a logistic regression predicting survival as a function of passenger class, sex, and age. Interpret the odds ratios. Who had the best survival odds?

  2. Simulate a count outcome (e.g., number of pest-infested fields) as a function of rainfall and temperature. Fit a Poisson regression and interpret the incidence rate ratios.

  3. Test your Poisson model from Exercise 2 for overdispersion. If overdispersion is present, fit a negative binomial model and compare AIC values.

  4. For a logistic regression model, explain in plain language the difference between: (a) a coefficient in log-odds, (b) an odds ratio, and (c) a predicted probability.

  5. Challenge: Using India’s National Family Health Survey (NFHS) data, model the binary outcome of child vaccination (full immunisation yes/no) as a function of mother’s education, wealth index, and place of residence (urban/rural). Create a coefficient plot and write a policy-relevant interpretation.