7 Hypothesis Testing

Learning Objectives

By the end of this chapter, you will be able to:

State the logic and vocabulary of null hypothesis significance testing
Perform one-sample, two-sample, and paired t-tests in R
Apply non-parametric alternatives when assumptions are violated
Conduct chi-squared tests for categorical data
Interpret p-values correctly and avoid common misinterpretations

7.1 The Logic of Hypothesis Testing

Null Hypothesis Significance Testing (NHST) follows a specific logic:

State the null hypothesis H₀ (the “nothing going on” hypothesis)
State the alternative hypothesis H₁ (what you expect to find)
Choose a significance level α (typically 0.05)
Compute a test statistic from the data
Compute the p-value: P(data this extreme | H₀ is true)
If p < α, reject H₀; otherwise, fail to reject H₀

What a p-value Is NOT

It is not the probability that H₀ is true
It is not the probability that your result occurred by chance
A p-value > 0.05 does not mean H₀ is true
A small p-value does not mean the effect is practically important

A p-value is: P(observing data this extreme, assuming H₀ is true).

7.2 Parametric Tests

7.2.1 One-Sample t-test

Question: Is the mean yield of our wheat variety significantly different from the national average of 4500 kg/ha?

Code

set.seed(42)
farm_yields <- rnorm(35, mean = 4800, sd = 680)

# H₀: μ = 4500
# H₁: μ ≠ 4500 (two-tailed)
result <- t.test(farm_yields, mu = 4500, alternative = "two.sided")
print(result)
#> 
#>  One Sample t-test
#> 
#> data:  farm_yields
#> t = 2.7948, df = 34, p-value = 0.008474
#> alternative hypothesis: true mean is not equal to 4500
#> 95 percent confidence interval:
#>  4603.852 5157.368
#> sample estimates:
#> mean of x 
#>   4880.61

# One-tailed: test if our variety EXCEEDS the average
t.test(farm_yields, mu = 4500, alternative = "greater")
#> 
#>  One Sample t-test
#> 
#> data:  farm_yields
#> t = 2.7948, df = 34, p-value = 0.004237
#> alternative hypothesis: true mean is greater than 4500
#> 95 percent confidence interval:
#>  4650.334      Inf
#> sample estimates:
#> mean of x 
#>   4880.61

7.2.2 Two-Sample t-test (Independent)

Question: Do two varieties (A and B) have different mean yields?

Code

variety_A <- rnorm(30, mean = 4700, sd = 600)
variety_B <- rnorm(30, mean = 4400, sd = 650)

# First: check if variances are equal (Levene's test)
library(car)
var.test(variety_A, variety_B)  # F-test for equal variances
#> 
#>  F test to compare two variances
#> 
#> data:  variety_A and variety_B
#> F = 1.3595, num df = 29, denom df = 29, p-value = 0.4132
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#>  0.6470874 2.8563626
#> sample estimates:
#> ratio of variances 
#>           1.359528

# Welch t-test (does not assume equal variances — default in R)
t.test(variety_A, variety_B, var.equal = FALSE)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  variety_A and variety_B
#> t = 0.43142, df = 56.684, p-value = 0.6678
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -239.2662  370.6546
#> sample estimates:
#> mean of x mean of y 
#>  4606.472  4540.778

# Student t-test (assumes equal variances)
t.test(variety_A, variety_B, var.equal = TRUE)
#> 
#>  Two Sample t-test
#> 
#> data:  variety_A and variety_B
#> t = 0.43142, df = 58, p-value = 0.6678
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -239.1154  370.5039
#> sample estimates:
#> mean of x mean of y 
#>  4606.472  4540.778

Code

tibble(
  yield   = c(variety_A, variety_B),
  variety = rep(c("Variety A", "Variety B"), each = 30)
) |>
  ggplot(aes(x = variety, y = yield, fill = variety)) +
  geom_boxplot(alpha = 0.7, width = 0.5) +
  geom_jitter(width = 0.15, alpha = 0.4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Yield Comparison: Variety A vs. Variety B",
       x = NULL, y = "Yield (kg/ha)") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Figure 7.1: Distribution of yields for Variety A and Variety B.

7.2.3 Paired t-test

Use when observations come in pairs — e.g., measuring the same field before and after treatment.

Code

set.seed(10)
before_treatment <- rnorm(25, mean = 3200, sd = 400)
after_treatment  <- before_treatment + rnorm(25, mean = 350, sd = 200)

# Paired t-test
t.test(after_treatment, before_treatment, paired = TRUE)
#> 
#>  Paired t-test
#> 
#> data:  after_treatment and before_treatment
#> t = 8.6592, df = 24, p-value = 7.556e-09
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#>  211.6576 344.1281
#> sample estimates:
#> mean difference 
#>        277.8928

# Equivalent to one-sample t-test on the differences
differences <- after_treatment - before_treatment
t.test(differences, mu = 0)
#> 
#>  One Sample t-test
#> 
#> data:  differences
#> t = 8.6592, df = 24, p-value = 7.556e-09
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  211.6576 344.1281
#> sample estimates:
#> mean of x 
#>  277.8928

7.3 Non-Parametric Tests

Non-parametric tests make fewer distributional assumptions and are appropriate when:

The data are clearly non-normal and n is small
Data are ordinal (ranked)
Outliers are present and cannot be removed

7.3.1 Mann-Whitney U Test (Wilcoxon Rank-Sum)

The non-parametric alternative to the independent two-sample t-test:

Code

# Simulate non-normal data
group_1 <- rexp(25, rate = 1/5)   # Right-skewed
group_2 <- rexp(25, rate = 1/7)   # Different rate

wilcox.test(group_1, group_2, alternative = "two.sided")
#> 
#>  Wilcoxon rank sum exact test
#> 
#> data:  group_1 and group_2
#> W = 255, p-value = 0.2714
#> alternative hypothesis: true location shift is not equal to 0

7.3.2 Wilcoxon Signed-Rank Test

The non-parametric alternative to the paired t-test:

Code

wilcox.test(after_treatment, before_treatment, paired = TRUE)
#> 
#>  Wilcoxon signed rank exact test
#> 
#> data:  after_treatment and before_treatment
#> V = 322, p-value = 2.98e-07
#> alternative hypothesis: true location shift is not equal to 0

7.3.3 Kruskal-Wallis Test

The non-parametric alternative to one-way ANOVA (for comparing 3+ groups):

Code

soil_pH <- tibble(
  pH    = c(rnorm(20, 6.2, 0.4), rnorm(20, 6.8, 0.5), rnorm(20, 7.1, 0.3)),
  group = rep(c("Sandy", "Loam", "Clay"), each = 20)
)

kruskal.test(pH ~ group, data = soil_pH)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  pH by group
#> Kruskal-Wallis chi-squared = 30.744, df = 2, p-value = 2.109e-07

# Post-hoc: pairwise Wilcoxon with correction
pairwise.wilcox.test(soil_pH$pH, soil_pH$group, p.adjust.method = "BH")
#> 
#>  Pairwise comparisons using Wilcoxon rank sum exact test 
#> 
#> data:  soil_pH$pH and soil_pH$group 
#> 
#>       Clay    Loam   
#> Loam  0.0061  -      
#> Sandy 1.2e-07 3.7e-05
#> 
#> P value adjustment method: BH

7.4 Chi-Squared Tests

7.4.1 Goodness of Fit

Tests whether observed frequencies match expected frequencies:

Code

# Observed die rolls
observed <- c(15, 18, 12, 20, 17, 18)   # 100 rolls, 6 sides
expected_probs <- rep(1/6, 6)             # Expected under fair die

chisq.test(observed, p = expected_probs)
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  observed
#> X-squared = 2.36, df = 5, p-value = 0.7974

7.4.2 Test of Independence

Tests whether two categorical variables are independent:

Code

# Contingency table: crop adoption by state
adoption_table <- matrix(
  c(45, 30, 25, 60, 20, 40, 35, 55, 10),
  nrow = 3,
  dimnames = list(
    State = c("Punjab", "Haryana", "UP"),
    Adoption = c("Improved Seeds", "Traditional", "Mixed")
  )
)

print(adoption_table)
#>          Adoption
#> State     Improved Seeds Traditional Mixed
#>   Punjab              45          60    35
#>   Haryana             30          20    55
#>   UP                  25          40    10
result_chi <- chisq.test(adoption_table)
print(result_chi)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  adoption_table
#> X-squared = 40.457, df = 4, p-value = 3.481e-08

# Expected counts (should be > 5 for chi-sq to be valid)
result_chi$expected
#>          Adoption
#> State     Improved Seeds Traditional   Mixed
#>   Punjab         43.7500      52.500 43.7500
#>   Haryana        32.8125      39.375 32.8125
#>   UP             23.4375      28.125 23.4375

# Standardised residuals (which cells drive the result?)
round(result_chi$stdres, 2)
#>          Adoption
#> State     Improved Seeds Traditional Mixed
#>   Punjab            0.30        1.75 -2.13
#>   Haryana          -0.72       -4.76  5.70
#>   UP                0.44        3.24 -3.83

7.5 Effect Sizes

Statistical significance tells you whether an effect exists; effect size tells you how large it is.

Code

library(effectsize)

# Cohen's d for t-tests
cohens_d(variety_A, variety_B)
#> Cohen's d |        95% CI
#> -------------------------
#> 0.11      | [-0.40, 0.62]
#> 
#> - Estimated using pooled SD.

# Interpretation: small=0.2, medium=0.5, large=0.8

# Eta-squared for chi-squared
cramers_v(adoption_table)
#> Cramer's V (adj.) |       95% CI
#> --------------------------------
#> 0.24              | [0.16, 1.00]
#> 
#> - One-sided CIs: upper bound fixed at [1.00].

7.6 Multiple Testing

When you run many tests, false positives accumulate. With 20 tests at α=0.05, you expect one false positive by chance.

Code

# Raw p-values
p_values <- c(0.01, 0.04, 0.08, 0.15, 0.03, 0.22, 0.001)

# Bonferroni correction (conservative)
p.adjust(p_values, method = "bonferroni")
#> [1] 0.070 0.280 0.560 1.000 0.210 1.000 0.007

# Benjamini-Hochberg (less conservative, controls false discovery rate)
p.adjust(p_values, method = "BH")
#> [1] 0.035 0.070 0.112 0.175 0.070 0.220 0.007

7.7 Exercises

The national average daily caloric intake is 2100 kcal. A survey of 45 households in a tribal district finds a mean of 1850 kcal (SD = 320). Test whether this district has significantly lower caloric intake than the national average. State H₀, H₁, and your conclusion clearly.
Two groups of farmers are given different training programs. After 6 months, their yield improvements are measured. Test whether the programs differ in effectiveness. Check the assumption of equal variances first.
Using airquality, test whether ozone levels are significantly different between May and August. First check whether the data are normally distributed (use shapiro.test()). Choose the appropriate test based on the result.
Create a 3×2 contingency table of a categorical variable of your choice and test for independence using a chi-squared test. Compute Cramér’s V to interpret the strength of association.
Challenge: Perform a simulation study to demonstrate the multiple testing problem. Generate 1000 datasets of 20 uncorrelated variables. Run pairwise t-tests. What proportion of results are “significant” at α=0.05? How does Bonferroni correction help?