8 Linear Regression Models – R for Statistical Analysis

8.1 Simple Linear Regression

Simple linear regression models the relationship between one predictor (X) and one outcome (Y):

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)\]

Code

data(mtcars)

# Research question: Does car weight predict fuel efficiency?
m_slr <- lm(mpg ~ wt, data = mtcars)
summary(m_slr)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5432 -2.3647 -0.1252  1.4096  6.8727 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
#> wt           -5.3445     0.5591  -9.559 1.29e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.046 on 30 degrees of freedom
#> Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
#> F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Code

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 2.5, colour = "#2c3e50", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, colour = "#e74c3c",
              fill = "#e74c3c", alpha = 0.15) +
  labs(
    title    = "Vehicle Weight vs. Fuel Efficiency",
    subtitle = sprintf("R² = %.3f, β₁ = %.2f",
                       summary(m_slr)$r.squared,
                       coef(m_slr)[2]),
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal(base_size = 13)

Figure 8.1: Simple linear regression: fuel efficiency decreases with vehicle weight.

8.1.1 Interpreting Coefficients

Code

coef(m_slr)
#> (Intercept)          wt 
#>   37.285126   -5.344472
confint(m_slr, level = 0.95)
#>                 2.5 %    97.5 %
#> (Intercept) 33.450500 41.119753
#> wt          -6.486308 -4.202635

Intercept (β₀ = 37.29): The predicted MPG when weight = 0 (not meaningful here — extrapolation).

Slope (β₁ = -5.34): For each additional 1000 lbs of weight, MPG decreases by 5.34 units on average.

R² = 0.753: 75.3% of the variation in MPG is explained by weight.

8.1.2 Predictions

Code

# Predict MPG for new weight values
new_cars <- data.frame(wt = c(2.0, 3.0, 4.0, 5.0))
predict(m_slr, newdata = new_cars, interval = "confidence")   # CI for mean
#>        fit      lwr      upr
#> 1 26.59618 24.82389 28.36848
#> 2 21.25171 20.12444 22.37899
#> 3 15.90724 14.49018 17.32429
#> 4 10.56277  8.24913 12.87641
predict(m_slr, newdata = new_cars, interval = "prediction")   # PI for individual
#>        fit       lwr      upr
#> 1 26.59618 20.128114 33.06425
#> 2 21.25171 14.929874 27.57355
#> 3 15.90724  9.527355 22.28712
#> 4 10.56277  3.925916 17.19962

8.2 Multiple Linear Regression

When we add more predictors:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]

Code

# Add horsepower and number of cylinders as predictors
m_mlr <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)
summary(m_mlr)
#> 
#> Call:
#> lm(formula = mpg ~ wt + hp + factor(cyl), data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.2612 -1.0320 -0.3210  0.9281  5.3947 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  35.84600    2.04102  17.563 2.67e-16 ***
#> wt           -3.18140    0.71960  -4.421 0.000144 ***
#> hp           -0.02312    0.01195  -1.934 0.063613 .  
#> factor(cyl)6 -3.35902    1.40167  -2.396 0.023747 *  
#> factor(cyl)8 -3.18588    2.17048  -1.468 0.153705    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.44 on 27 degrees of freedom
#> Multiple R-squared:  0.8572, Adjusted R-squared:  0.8361 
#> F-statistic: 40.53 on 4 and 27 DF,  p-value: 4.869e-11

8.2.1 Comparing Models

Code

# Nested model comparison with F-test
m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)

anova(m1, m2, m3)
#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ wt
#> Model 2: mpg ~ wt + hp
#> Model 3: mpg ~ wt + hp + factor(cyl)
#>   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
#> 1     30 278.32                                   
#> 2     29 195.05  1    83.274 13.9846 0.0008777 ***
#> 3     27 160.78  2    34.270  2.8776 0.0736450 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# AIC and BIC (lower = better)
AIC(m1, m2, m3)
#>    df      AIC
#> m1  3 166.0294
#> m2  4 156.6523
#> m3  6 154.4692
BIC(m1, m2, m3)
#>    df      BIC
#> m1  3 170.4266
#> m2  4 162.5153
#> m3  6 163.2636

8.2.2 Regression Tables with `modelsummary`

Code

modelsummary(
  list("Model 1" = m1, "Model 2" = m2, "Model 3" = m3),
  stars      = TRUE,
  gof_map    = c("nobs", "r.squared", "adj.r.squared", "AIC"),
  coef_rename = c(
    "(Intercept)"   = "Intercept",
    "wt"            = "Weight",
    "hp"            = "Horsepower",
    "factor(cyl)6"  = "6 Cylinders",
    "factor(cyl)8"  = "8 Cylinders"
  ),
  notes = "Standard errors in parentheses. *, **, *** = 10%, 5%, 1% significance."
)

Table 8.1: Regression models predicting fuel efficiency (MPG).

	Model 1	Model 2	Model 3
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Standard errors in parentheses. , , ** = 10%, 5%, 1% significance.
Intercept	37.285***	37.227***	35.846***
	(1.878)	(1.599)	(2.041)
Weight	-5.344***	-3.878***	-3.181***
	(0.559)	(0.633)	(0.720)
Horsepower		-0.032**	-0.023+
		(0.009)	(0.012)
6 Cylinders			-3.359*
			(1.402)
8 Cylinders			-3.186
			(2.170)
Num.Obs.	32	32	32
R2	0.753	0.827	0.857
R2 Adj.	0.745	0.815	0.836

8.3 Regression Diagnostics

Linear regression requires four key assumptions: Linearity, Independence, Normality of residuals, Equal variance (LINE).

Code

par(mfrow = c(2, 2))
plot(m_mlr, which = 1:4)
par(mfrow = c(1, 1))

Figure 8.2: Regression diagnostic plots for the multiple regression model.

8.3.1 Residuals vs. Fitted

Code

augmented <- augment(m_mlr)

p1 <- ggplot(augmented, aes(.fitted, .resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "red") +
  geom_smooth(se = FALSE, colour = "#2980b9") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted") +
  theme_minimal()

p2 <- ggplot(augmented, aes(sample = .std.resid)) +
  stat_qq(alpha = 0.7) +
  stat_qq_line(colour = "red") +
  labs(x = "Theoretical Quantiles", y = "Standardised Residuals",
       title = "Normal Q-Q Plot") +
  theme_minimal()

library(patchwork)
p1 | p2

Figure 8.3: Residual plots using ggplot2 via broom.

8.3.2 Checking for Multicollinearity

Multicollinearity occurs when predictors are highly correlated. It inflates standard errors and makes coefficients unreliable.

Code

# Variance Inflation Factors
vif(m_mlr)
#>                 GVIF Df GVIF^(1/(2*Df))
#> wt          2.580877  1        1.606511
#> hp          3.496014  1        1.869763
#> factor(cyl) 5.105811  2        1.503198

# Rule of thumb: VIF > 5 (or 10) indicates a problem
# GVIF^(1/(2*Df)) for factors

VIF Interpretation

VIF = 1: No multicollinearity
VIF 1–5: Moderate (usually acceptable)
VIF > 5: High (investigate)
VIF > 10: Severe (consider dropping or combining variables)

8.3.3 Influential Observations

Code

augmented |>
  mutate(obs = row_number()) |>
  ggplot(aes(x = obs, y = .cooksd)) +
  geom_col(fill = "#e74c3c", alpha = 0.7) +
  geom_hline(yintercept = 4 / nrow(mtcars),
             linetype = "dashed", colour = "grey40") +
  labs(title = "Cook's Distance",
       subtitle = "Dashed line = 4/n threshold",
       x = "Observation", y = "Cook's Distance") +
  theme_minimal(base_size = 12)

# Which observations have high influence?
augmented |>
  arrange(desc(.cooksd)) |>
  select(.rownames, mpg, wt, hp, .fitted, .cooksd) |>
  head(5)
#> # A tibble: 5 × 6
#>   .rownames           mpg    wt    hp .fitted .cooksd
#>   <chr>             <dbl> <dbl> <dbl>   <dbl>   <dbl>
#> 1 Chrysler Imperial  14.7  5.34   230    10.3  0.262 
#> 2 Toyota Corolla     33.9  1.84    65    28.5  0.145 
#> 3 Maserati Bora      15    3.57   335    13.6  0.117 
#> 4 Fiat 128           32.4  2.2     66    27.3  0.105 
#> 5 AMC Javelin        15.2  3.44   150    18.3  0.0855

Figure 8.4: Cook’s distance identifies observations with high influence on the model.

8.4 Using `broom` for Tidy Model Output

Code

# Tidy coefficient table
tidy(m_mlr, conf.int = TRUE)
#> # A tibble: 5 × 7
#>   term         estimate std.error statistic  p.value conf.low conf.high
#>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)   35.8       2.04       17.6  2.67e-16  31.7     40.0    
#> 2 wt            -3.18      0.720      -4.42 1.44e- 4  -4.66    -1.70   
#> 3 hp            -0.0231    0.0120     -1.93 6.36e- 2  -0.0476   0.00140
#> 4 factor(cyl)6  -3.36      1.40       -2.40 2.37e- 2  -6.24    -0.483  
#> 5 factor(cyl)8  -3.19      2.17       -1.47 1.54e- 1  -7.64     1.27

# Model-level statistics
glance(m_mlr)
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0.857         0.836  2.44      40.5 4.87e-11     4  -71.2  154.  163.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Observation-level statistics (fitted, residuals, leverage, etc.)
augment(m_mlr) |> head(5)
#> # A tibble: 5 × 11
#>   .rownames   mpg    wt    hp `factor(cyl)` .fitted .resid   .hat .sigma .cooksd
#>   <chr>     <dbl> <dbl> <dbl> <fct>           <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1 Mazda RX4  21    2.62   110 6                21.6 -0.609 0.168    2.48 3.01e-3
#> 2 Mazda RX…  21    2.88   110 6                20.8  0.203 0.151    2.49 2.90e-4
#> 3 Datsun 7…  22.8  2.32    93 4                26.3 -3.51  0.0936   2.38 4.73e-2
#> 4 Hornet 4…  21.4  3.22   110 6                19.7  1.68  0.147    2.46 1.93e-2
#> 5 Hornet S…  18.7  3.44   175 8                17.7  1.03  0.126    2.48 5.86e-3
#> # ℹ 1 more variable: .std.resid <dbl>

8.5 Exercises

Using the mtcars dataset, regress qsec (quarter-mile time) on hp and wt. Interpret each coefficient and the R-squared value. Are the coefficients statistically significant?
Fit three nested regression models predicting Ozone in the airquality dataset: (a) Solar.R only, (b) Solar.R + Wind, (c) Solar.R + Wind + Temp. Compare them using anova() and AIC. Which model is best?
Conduct a full diagnostic analysis for model (c) from Exercise 2. Are the assumptions satisfied? If not, what would you do?
Add an interaction term to your model: Wind * Temp. Is the interaction significant? Interpret what it means.
Challenge: Download India district-level crop yield data and build a multiple regression model predicting wheat yield from rainfall, fertiliser use, and irrigated area. Present your model in a clean modelsummary table and write a brief interpretation of each coefficient.

8.1 Simple Linear Regression

8.1.1 Interpreting Coefficients

8.1.2 Predictions

8.2 Multiple Linear Regression

8.2.1 Comparing Models

8.2.2 Regression Tables with modelsummary

8.3 Regression Diagnostics

8.3.1 Residuals vs. Fitted

8.3.2 Checking for Multicollinearity

8.3.3 Influential Observations

8.4 Using broom for Tidy Model Output

8.5 Exercises

8.2.2 Regression Tables with `modelsummary`

8.4 Using `broom` for Tidy Model Output