8  Linear Regression Models

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Fit and interpret simple and multiple linear regression models using lm()
  • Understand and evaluate R-squared and adjusted R-squared
  • Read regression diagnostic plots and identify problems
  • Check for multicollinearity using Variance Inflation Factors
  • Present clean regression tables using modelsummary

8.1 Simple Linear Regression

Simple linear regression models the relationship between one predictor (X) and one outcome (Y):

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)\]

Code
data(mtcars)

# Research question: Does car weight predict fuel efficiency?
m_slr <- lm(mpg ~ wt, data = mtcars)
summary(m_slr)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5432 -2.3647 -0.1252  1.4096  6.8727 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
#> wt           -5.3445     0.5591  -9.559 1.29e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.046 on 30 degrees of freedom
#> Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
#> F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
Code
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 2.5, colour = "#2c3e50", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, colour = "#e74c3c",
              fill = "#e74c3c", alpha = 0.15) +
  labs(
    title    = "Vehicle Weight vs. Fuel Efficiency",
    subtitle = sprintf("R² = %.3f, β₁ = %.2f",
                       summary(m_slr)$r.squared,
                       coef(m_slr)[2]),
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal(base_size = 13)
Figure 8.1: Simple linear regression: fuel efficiency decreases with vehicle weight.

8.1.1 Interpreting Coefficients

Code
coef(m_slr)
#> (Intercept)          wt 
#>   37.285126   -5.344472
confint(m_slr, level = 0.95)
#>                 2.5 %    97.5 %
#> (Intercept) 33.450500 41.119753
#> wt          -6.486308 -4.202635

Intercept (β₀ = 37.29): The predicted MPG when weight = 0 (not meaningful here — extrapolation).

Slope (β₁ = -5.34): For each additional 1000 lbs of weight, MPG decreases by 5.34 units on average.

R² = 0.753: 75.3% of the variation in MPG is explained by weight.

8.1.2 Predictions

Code
# Predict MPG for new weight values
new_cars <- data.frame(wt = c(2.0, 3.0, 4.0, 5.0))
predict(m_slr, newdata = new_cars, interval = "confidence")   # CI for mean
#>        fit      lwr      upr
#> 1 26.59618 24.82389 28.36848
#> 2 21.25171 20.12444 22.37899
#> 3 15.90724 14.49018 17.32429
#> 4 10.56277  8.24913 12.87641
predict(m_slr, newdata = new_cars, interval = "prediction")   # PI for individual
#>        fit       lwr      upr
#> 1 26.59618 20.128114 33.06425
#> 2 21.25171 14.929874 27.57355
#> 3 15.90724  9.527355 22.28712
#> 4 10.56277  3.925916 17.19962

8.2 Multiple Linear Regression

When we add more predictors:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]

Code
# Add horsepower and number of cylinders as predictors
m_mlr <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)
summary(m_mlr)
#> 
#> Call:
#> lm(formula = mpg ~ wt + hp + factor(cyl), data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.2612 -1.0320 -0.3210  0.9281  5.3947 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  35.84600    2.04102  17.563 2.67e-16 ***
#> wt           -3.18140    0.71960  -4.421 0.000144 ***
#> hp           -0.02312    0.01195  -1.934 0.063613 .  
#> factor(cyl)6 -3.35902    1.40167  -2.396 0.023747 *  
#> factor(cyl)8 -3.18588    2.17048  -1.468 0.153705    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.44 on 27 degrees of freedom
#> Multiple R-squared:  0.8572, Adjusted R-squared:  0.8361 
#> F-statistic: 40.53 on 4 and 27 DF,  p-value: 4.869e-11

8.2.1 Comparing Models

Code
# Nested model comparison with F-test
m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)

anova(m1, m2, m3)
#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ wt
#> Model 2: mpg ~ wt + hp
#> Model 3: mpg ~ wt + hp + factor(cyl)
#>   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
#> 1     30 278.32                                   
#> 2     29 195.05  1    83.274 13.9846 0.0008777 ***
#> 3     27 160.78  2    34.270  2.8776 0.0736450 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# AIC and BIC (lower = better)
AIC(m1, m2, m3)
#>    df      AIC
#> m1  3 166.0294
#> m2  4 156.6523
#> m3  6 154.4692
BIC(m1, m2, m3)
#>    df      BIC
#> m1  3 170.4266
#> m2  4 162.5153
#> m3  6 163.2636

8.2.2 Regression Tables with modelsummary

Code
modelsummary(
  list("Model 1" = m1, "Model 2" = m2, "Model 3" = m3),
  stars      = TRUE,
  gof_map    = c("nobs", "r.squared", "adj.r.squared", "AIC"),
  coef_rename = c(
    "(Intercept)"   = "Intercept",
    "wt"            = "Weight",
    "hp"            = "Horsepower",
    "factor(cyl)6"  = "6 Cylinders",
    "factor(cyl)8"  = "8 Cylinders"
  ),
  notes = "Standard errors in parentheses. *, **, *** = 10%, 5%, 1% significance."
)
Table 8.1: Regression models predicting fuel efficiency (MPG).
Model 1 Model 2 Model 3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors in parentheses. *, **, *** = 10%, 5%, 1% significance.
Intercept 37.285*** 37.227*** 35.846***
(1.878) (1.599) (2.041)
Weight -5.344*** -3.878*** -3.181***
(0.559) (0.633) (0.720)
Horsepower -0.032** -0.023+
(0.009) (0.012)
6 Cylinders -3.359*
(1.402)
8 Cylinders -3.186
(2.170)
Num.Obs. 32 32 32
R2 0.753 0.827 0.857
R2 Adj. 0.745 0.815 0.836

8.3 Regression Diagnostics

Linear regression requires four key assumptions: Linearity, Independence, Normality of residuals, Equal variance (LINE).

Code
par(mfrow = c(2, 2))
plot(m_mlr, which = 1:4)
par(mfrow = c(1, 1))
Figure 8.2: Regression diagnostic plots for the multiple regression model.

8.3.1 Residuals vs. Fitted

Code
augmented <- augment(m_mlr)

p1 <- ggplot(augmented, aes(.fitted, .resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "red") +
  geom_smooth(se = FALSE, colour = "#2980b9") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted") +
  theme_minimal()

p2 <- ggplot(augmented, aes(sample = .std.resid)) +
  stat_qq(alpha = 0.7) +
  stat_qq_line(colour = "red") +
  labs(x = "Theoretical Quantiles", y = "Standardised Residuals",
       title = "Normal Q-Q Plot") +
  theme_minimal()

library(patchwork)
p1 | p2
Figure 8.3: Residual plots using ggplot2 via broom.

8.3.2 Checking for Multicollinearity

Multicollinearity occurs when predictors are highly correlated. It inflates standard errors and makes coefficients unreliable.

Code
# Variance Inflation Factors
vif(m_mlr)
#>                 GVIF Df GVIF^(1/(2*Df))
#> wt          2.580877  1        1.606511
#> hp          3.496014  1        1.869763
#> factor(cyl) 5.105811  2        1.503198

# Rule of thumb: VIF > 5 (or 10) indicates a problem
# GVIF^(1/(2*Df)) for factors
TipVIF Interpretation
  • VIF = 1: No multicollinearity
  • VIF 1–5: Moderate (usually acceptable)
  • VIF > 5: High (investigate)
  • VIF > 10: Severe (consider dropping or combining variables)

8.3.3 Influential Observations

Code
augmented |>
  mutate(obs = row_number()) |>
  ggplot(aes(x = obs, y = .cooksd)) +
  geom_col(fill = "#e74c3c", alpha = 0.7) +
  geom_hline(yintercept = 4 / nrow(mtcars),
             linetype = "dashed", colour = "grey40") +
  labs(title = "Cook's Distance",
       subtitle = "Dashed line = 4/n threshold",
       x = "Observation", y = "Cook's Distance") +
  theme_minimal(base_size = 12)

# Which observations have high influence?
augmented |>
  arrange(desc(.cooksd)) |>
  select(.rownames, mpg, wt, hp, .fitted, .cooksd) |>
  head(5)
#> # A tibble: 5 × 6
#>   .rownames           mpg    wt    hp .fitted .cooksd
#>   <chr>             <dbl> <dbl> <dbl>   <dbl>   <dbl>
#> 1 Chrysler Imperial  14.7  5.34   230    10.3  0.262 
#> 2 Toyota Corolla     33.9  1.84    65    28.5  0.145 
#> 3 Maserati Bora      15    3.57   335    13.6  0.117 
#> 4 Fiat 128           32.4  2.2     66    27.3  0.105 
#> 5 AMC Javelin        15.2  3.44   150    18.3  0.0855
Figure 8.4: Cook’s distance identifies observations with high influence on the model.

8.4 Using broom for Tidy Model Output

Code
# Tidy coefficient table
tidy(m_mlr, conf.int = TRUE)
#> # A tibble: 5 × 7
#>   term         estimate std.error statistic  p.value conf.low conf.high
#>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)   35.8       2.04       17.6  2.67e-16  31.7     40.0    
#> 2 wt            -3.18      0.720      -4.42 1.44e- 4  -4.66    -1.70   
#> 3 hp            -0.0231    0.0120     -1.93 6.36e- 2  -0.0476   0.00140
#> 4 factor(cyl)6  -3.36      1.40       -2.40 2.37e- 2  -6.24    -0.483  
#> 5 factor(cyl)8  -3.19      2.17       -1.47 1.54e- 1  -7.64     1.27

# Model-level statistics
glance(m_mlr)
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0.857         0.836  2.44      40.5 4.87e-11     4  -71.2  154.  163.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Observation-level statistics (fitted, residuals, leverage, etc.)
augment(m_mlr) |> head(5)
#> # A tibble: 5 × 11
#>   .rownames   mpg    wt    hp `factor(cyl)` .fitted .resid   .hat .sigma .cooksd
#>   <chr>     <dbl> <dbl> <dbl> <fct>           <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1 Mazda RX4  21    2.62   110 6                21.6 -0.609 0.168    2.48 3.01e-3
#> 2 Mazda RX…  21    2.88   110 6                20.8  0.203 0.151    2.49 2.90e-4
#> 3 Datsun 7…  22.8  2.32    93 4                26.3 -3.51  0.0936   2.38 4.73e-2
#> 4 Hornet 4…  21.4  3.22   110 6                19.7  1.68  0.147    2.46 1.93e-2
#> 5 Hornet S…  18.7  3.44   175 8                17.7  1.03  0.126    2.48 5.86e-3
#> # ℹ 1 more variable: .std.resid <dbl>

8.5 Exercises

  1. Using the mtcars dataset, regress qsec (quarter-mile time) on hp and wt. Interpret each coefficient and the R-squared value. Are the coefficients statistically significant?

  2. Fit three nested regression models predicting Ozone in the airquality dataset: (a) Solar.R only, (b) Solar.R + Wind, (c) Solar.R + Wind + Temp. Compare them using anova() and AIC. Which model is best?

  3. Conduct a full diagnostic analysis for model (c) from Exercise 2. Are the assumptions satisfied? If not, what would you do?

  4. Add an interaction term to your model: Wind * Temp. Is the interaction significant? Interpret what it means.

  5. Challenge: Download India district-level crop yield data and build a multiple regression model predicting wheat yield from rainfall, fertiliser use, and irrigated area. Present your model in a clean modelsummary table and write a brief interpretation of each coefficient.