9 Generalized Linear Models

Learning Objectives

By the end of this chapter, you will be able to:

Explain why linear regression fails for binary and count outcomes
Fit and interpret logistic regression models
Compute and interpret odds ratios
Fit Poisson regression for count data
Assess model fit with deviance and AIC

9.1 Why GLMs?

Linear regression assumes the outcome is continuous and normally distributed. Many real outcomes violate this:

Binary outcomes (yes/no, success/failure) → Logistic Regression
Count outcomes (number of events, hospital visits) → Poisson Regression
Rate outcomes (events per person-year) → Poisson with offset

Generalized Linear Models (GLMs) unify these cases through: 1. A distribution (Normal, Binomial, Poisson, …) 2. A link function connecting the linear predictor to the outcome mean

9.2 Logistic Regression

For a binary outcome \(Y \in \{0, 1\}\), we model the log-odds:

\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]

Code

# Simulate: probability of adopting improved seeds depends on education and income
set.seed(42)
n <- 300

farm_data <- tibble(
  education_yrs = round(rnorm(n, mean = 8, sd = 3), 0) |> pmax(0),
  income_lakh   = round(runif(n, 1, 15), 1),
  access_market = rbinom(n, 1, 0.6)
) |>
  mutate(
    log_odds = -3.5 + 0.25 * education_yrs + 0.12 * income_lakh + 0.8 * access_market,
    prob     = plogis(log_odds),
    adopted  = rbinom(n, 1, prob)
  )

table(farm_data$adopted)
#> 
#>   0   1 
#> 170 130
mean(farm_data$adopted)
#> [1] 0.4333333

Code

m_logit <- glm(
  adopted ~ education_yrs + income_lakh + access_market,
  data   = farm_data,
  family = binomial(link = "logit")
)

summary(m_logit)
#> 
#> Call:
#> glm(formula = adopted ~ education_yrs + income_lakh + access_market, 
#>     family = binomial(link = "logit"), data = farm_data)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   -4.11720    0.57978  -7.101 1.24e-12 ***
#> education_yrs  0.26643    0.04931   5.403 6.56e-08 ***
#> income_lakh    0.13649    0.03375   4.044 5.25e-05 ***
#> access_market  0.91009    0.27721   3.283  0.00103 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 410.54  on 299  degrees of freedom
#> Residual deviance: 347.33  on 296  degrees of freedom
#> AIC: 355.33
#> 
#> Number of Fisher Scoring iterations: 4

9.2.1 Interpreting Logistic Regression

Raw coefficients are in log-odds, which are hard to interpret. We exponentiate to get odds ratios:

Code

# Tidy with confidence intervals
tidy_logit <- tidy(m_logit, exponentiate = TRUE, conf.int = TRUE)
print(tidy_logit)
#> # A tibble: 4 × 7
#>   term          estimate std.error statistic  p.value conf.low conf.high
#>   <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)     0.0163    0.580      -7.10 1.24e-12  0.00495    0.0484
#> 2 education_yrs   1.31      0.0493      5.40 6.56e- 8  1.19       1.44  
#> 3 income_lakh     1.15      0.0338      4.04 5.25e- 5  1.07       1.23  
#> 4 access_market   2.48      0.277       3.28 1.03e- 3  1.45       4.32

Code

tidy_logit |>
  filter(term != "(Intercept)") |>
  mutate(term = recode(term,
                       "education_yrs" = "Years of Education",
                       "income_lakh"   = "Income (Lakh ₹)",
                       "access_market" = "Market Access")) |>
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_point(size = 3.5, colour = "#e74c3c") +
  scale_x_log10() +
  labs(title = "Odds Ratios: Adoption of Improved Seeds",
       x = "Odds Ratio (log scale)", y = NULL) +
  theme_minimal(base_size = 13)

Figure 9.1: Odds ratios with 95% confidence intervals for adoption of improved seeds.

Interpreting Odds Ratios

OR = 1: No association
OR > 1: Predictor increases the odds of the outcome
OR < 1: Predictor decreases the odds of the outcome

“Farmers with market access have 2.48 times the odds of adopting improved seeds, compared to those without market access, holding other variables constant.”

9.2.2 Predicted Probabilities

Code

# Create prediction grid
pred_grid <- expand_grid(
  income_lakh   = seq(1, 15, by = 0.5),
  access_market = c(0, 1),
  education_yrs = mean(farm_data$education_yrs)
) |>
  mutate(
    pred_prob = predict(m_logit, newdata = pick(everything()), type = "response"),
    market    = factor(access_market, labels = c("No Access", "Market Access"))
  )

ggplot(pred_grid, aes(x = income_lakh, y = pred_prob, colour = market)) +
  geom_line(linewidth = 1.3) +
  scale_colour_brewer(palette = "Set1") +
  labs(title   = "Predicted Probability of Seed Adoption",
       subtitle = paste0("Education held at mean (", round(mean(farm_data$education_yrs), 1), " years)"),
       x = "Income (Lakh ₹)", y = "Predicted Probability", colour = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal(base_size = 13)

Figure 9.2: Predicted probability of seed adoption by income level and market access.

9.2.3 Model Fit

Code

# Null and residual deviance
glance(m_logit)
#> # A tibble: 1 × 8
#>   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
#>           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#> 1          411.     299  -174.  355.  370.     347.         296   300

# McFadden's pseudo R-squared
mcfadden_r2 <- 1 - (m_logit$deviance / m_logit$null.deviance)
cat("McFadden R²:", round(mcfadden_r2, 3), "\n")
#> McFadden R²: 0.154

# Classification accuracy
threshold <- 0.5
predicted_class <- as.integer(fitted(m_logit) >= threshold)
conf_matrix <- table(Predicted = predicted_class, Actual = farm_data$adopted)
print(conf_matrix)
#>          Actual
#> Predicted   0   1
#>         0 135  51
#>         1  35  79
accuracy <- mean(predicted_class == farm_data$adopted)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")
#> Accuracy: 71.3 %

9.3 Poisson Regression

For count outcomes (non-negative integers), Poisson regression models the log of the expected count:

\[\log(\mu_i) = \beta_0 + \beta_1 X_{1i} + \cdots\]

Code

# Hospital admissions per month in different districts
set.seed(7)
hospital_data <- tibble(
  district      = rep(paste("District", 1:30), each = 12),
  month         = rep(1:12, 30),
  n_admissions  = rpois(360, lambda = exp(2.8 + 0.04 * rep(1:12, 30))),
  temperature   = round(15 + 10 * sin(2 * pi * (1:360) / 12) + rnorm(360, 0, 2), 1),
  urban         = rbinom(360, 1, 0.4)
)

head(hospital_data)
#> # A tibble: 6 × 5
#>   district   month n_admissions temperature urban
#>   <chr>      <int>        <int>       <dbl> <int>
#> 1 District 1     1           26        22.2     0
#> 2 District 1     2           12        24.7     1
#> 3 District 1     3           22        24.3     1
#> 4 District 1     4           27        23.4     1
#> 5 District 1     5           19        20.7     1
#> 6 District 1     6           17        16.6     1

Code

m_poisson <- glm(
  n_admissions ~ temperature + urban + factor(month),
  data   = hospital_data,
  family = poisson(link = "log")
)

# Exponentiated coefficients are Incidence Rate Ratios (IRR)
tidy(m_poisson, exponentiate = TRUE, conf.int = TRUE) |>
  filter(!str_detect(term, "month")) |>
  select(term, estimate, conf.low, conf.high, p.value)
#> # A tibble: 3 × 5
#>   term        estimate conf.low conf.high   p.value
#>   <chr>          <dbl>    <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    17.1    13.4       21.8  9.96e-116
#> 2 temperature     1.00    0.989      1.01 8.73e-  1
#> 3 urban           1.01    0.965      1.06 6.63e-  1

9.3.1 Checking for Overdispersion

A key assumption of Poisson regression is that the mean equals the variance. When the variance exceeds the mean (overdispersion), use negative binomial regression instead.

Code

# Dispersion test
dispersion_ratio <- sum(residuals(m_poisson, type = "pearson")^2) / m_poisson$df.residual
cat("Dispersion ratio:", round(dispersion_ratio, 3),
    "\n(Values >> 1 indicate overdispersion)\n")
#> Dispersion ratio: 0.97 
#> (Values >> 1 indicate overdispersion)

# If overdispersed, use negative binomial
library(MASS)
m_negbin <- MASS::glm.nb(n_admissions ~ temperature + urban, data = hospital_data)

9.4 Regression Table: All Models

Code

modelsummary(
  list("Logistic (Adoption)" = m_logit, "Poisson (Admissions)" = m_poisson),
  exponentiate = TRUE,
  stars        = TRUE,
  gof_map      = c("nobs", "AIC", "BIC"),
  coef_omit    = "month"
)

Table 9.1: Summary of GLM specifications. Logistic uses odds ratios; Poisson uses IRR.

	Logistic (Adoption)	Poisson (Admissions)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	0.016***	17.088***
	(0.009)	(2.121)
education_yrs	1.305***
	(0.064)
income_lakh	1.146***
	(0.039)
access_market	2.485**
	(0.689)
temperature		1.001
		(0.006)
urban		1.010
		(0.024)
Num.Obs.	300	360

9.5 Exercises

Using the Titanic dataset (or titanic package), fit a logistic regression predicting survival as a function of passenger class, sex, and age. Interpret the odds ratios. Who had the best survival odds?
Simulate a count outcome (e.g., number of pest-infested fields) as a function of rainfall and temperature. Fit a Poisson regression and interpret the incidence rate ratios.
Test your Poisson model from Exercise 2 for overdispersion. If overdispersion is present, fit a negative binomial model and compare AIC values.
For a logistic regression model, explain in plain language the difference between: (a) a coefficient in log-odds, (b) an odds ratio, and (c) a predicted probability.
Challenge: Using India’s National Family Health Survey (NFHS) data, model the binary outcome of child vaccination (full immunisation yes/no) as a function of mother’s education, wealth index, and place of residence (urban/rural). Create a coefficient plot and write a policy-relevant interpretation.