• Blog
  • Teaching
    • Jacob’s Teaching Experience
    • Introduction to R
    • Introduction to Git/GitHub
    • API 222 Section Material
  • Research

On this page

  • Introduction
  • 1. Simulate Some Data
    • Quick Plot of the Data
  • 2. Frequentist OLS for Reference
  • 3. Bayesian Linear Regression
    • 3.1 What Are Priors Here?
    • 3.2 Posterior Summaries
  • 4. Explaining the Posterior More Explicitly
    • 4.1 Combining Prior and Likelihood
  • Posterior Predictive Distribution: Why It Matters
    • The Bayesian Approach: Uncertainty in Predictions
    • Simulating from the Posterior Predictive Distribution
  • Key Takeaways: Why This Matters

A Gentle Introduction to Bayesian Linear Regression in R

bayesian
regression
This guide walks you through Bayesian linear regression in R and Stan, explaining how priors, likelihoods, and posteriors work in an intuitive and hands-on way.
Author

Jacob Jameson

Published

February 9, 2025

Introduction

Linear regression is a fundamental tool for modeling relationships between an outcome variable and one or more predictors. Traditionally, we often do this via ordinary least squares (OLS). In the frequentist approach, we estimate regression coefficients by finding values that best explain the observed data. The uncertainty in these estimates is represented by confidence intervals (CIs), which describe the expected variation if we were to repeat the entire data-collection process many times.

However, confidence intervals are often misinterpreted. Many people assume a 95% confidence interval means, “There is a 95% probability that the true parameter lies within this interval.” In reality, this is incorrect—CIs do not assign probabilities to parameters. Instead, they describe how often the interval would contain the true parameter across many repeated samples, which is not the way we typically think about uncertainty.

In contrast, Bayesian credible intervals behave the way we actually want these things to work. A 95% credible interval does mean that, given our observed data and model, “There is a 95% probability that the true parameter lies within this range.” This makes Bayesian inference particularly appealing when we want intuitive probability statements about our parameters.

Bayesian linear regression treats model parameters (e.g., intercept, slope) as random variables with their own probability distributions. Instead of producing a single “best” estimate for each coefficient, we derive a posterior distribution, which represents a range of possible values for each parameter, given both:

  • Our prior beliefs (before seeing the data)

  • The likelihood of the observed data under the model assumptions

The result is a credible interval, which gives us a direct probability statement about our parameters, making it a powerful tool in statistical modeling.

In this post, we’ll work through an example in R to see:

  • How the posterior distribution updates from the prior distribution.

  • How to visualize priors versus posteriors.

  • How to interpret Bayesian predictions (including posterior predictive distributions).


1. Simulate Some Data

Let’s simulate a small dataset with a linear relationship. Suppose the “true” model is:

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

where:

  • \(\beta_0 = 2.0\) (intercept),

  • \(\beta_1 = 3.5\) (slope),

  • \(\varepsilon \sim \text{Normal}(0,1)\) (noise).

library(ggplot2)
library(dplyr)
library(rstanarm)  # for Bayesian regression

set.seed(42)

n <- 50
x <- seq(0, 5, length.out = n)
beta0_true <- 2.0
beta1_true <- 3.5
sigma_true <- 1.0

# Generate y
noise <- rnorm(n, mean = 0, sd = sigma_true)
y <- beta0_true + beta1_true * x + noise

df <- data.frame(x, y)
head(df)
          x        y
1 0.0000000 3.370958
2 0.1020408 1.792445
3 0.2040816 3.077414
4 0.3061224 3.704291
5 0.4081633 3.832840
6 0.5102041 3.679590

Quick Plot of the Data

ggplot(df, aes(x=x, y=y)) +
  geom_point(color = "#008Fd5", alpha = 0.9, size=2) +
  geom_abline(intercept=beta0_true, slope=beta1_true, 
              color="red", linetype="dashed") +
  theme_minimal(base_size = 15) +
  labs(title="Simulated Data with True Regression Line",
       subtitle="Red dashed line = true relationship")


2. Frequentist OLS for Reference

First, we fit a simple OLS model:

ols_model <- lm(y ~ x, data = df)
summary(ols_model)

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7403 -0.4366 -0.1193  0.8319  2.1072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.3548     0.3175   7.416  1.7e-09 ***
x             3.3438     0.1094  30.555  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.139 on 48 degrees of freedom
Multiple R-squared:  0.9511,    Adjusted R-squared:  0.9501 
F-statistic: 933.6 on 1 and 48 DF,  p-value: < 2.2e-16

We get a point estimate for the intercept and slope plus confidence intervals.


3. Bayesian Linear Regression

Now let’s do a Bayesian version via the rstanarm package. Under the hood, stan_glm uses weakly informative priors by default (you can customize these).

bayes_model <- stan_glm(y ~ x, data=df, 
                        chains=2, iter=2000, seed=42)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.024376 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 243.76 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.026 seconds (Warm-up)
Chain 1:                0.032 seconds (Sampling)
Chain 1:                0.058 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.04 seconds (Warm-up)
Chain 2:                0.023 seconds (Sampling)
Chain 2:                0.063 seconds (Total)
Chain 2: 
print(bayes_model, digits=3)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 50
 predictors:   2
------
            Median MAD_SD
(Intercept) 2.372  0.334 
x           3.337  0.115 

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.156  0.117 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The summary output gives us:

  • Posterior mean: The most likely value for each coefficient given the data.

  • Standard deviation: How uncertain we are about each estimate.

  • 95% credible interval: The range where the true coefficient likely falls with 95% probability.

Unlike frequentist confidence intervals, which describe the long-run variability across repeated samples, Bayesian credible intervals provide a direct probability statement about our parameters.

3.1 What Are Priors Here?

By default, stan_glm(..., family = gaussian()) uses something akin to a weakly informative prior on the slope and intercept. This means the prior allows a wide range of possible values for the coefficients but discourages extremely large magnitudes. You can supply arguments like prior, prior_intercept, etc., or switch to brms for more flexible syntax.

3.2 Posterior Summaries

The printed output typically gives us:

  • mean: Posterior mean of the parameter.

  • sd: Posterior standard deviation (akin to “uncertainty”).

  • 2.5% / 97.5%: Bounds of the 95% credible interval.

We can visualize these distributions:

plot(bayes_model, 
     plotfun = "areas",  
     pars = c("(Intercept)", "x"),  
     include = TRUE, 
     prob = 0.95,  border = "black") + 
  theme_minimal(base_size = 15) + 
  labs(title = "Posterior Distributions with 95% Credible Interval",
       x = "Parameter Value",
       y = "Density")

Each density plot is the posterior distribution for one parameter, showing the entire spread of plausible values given the model and data. The shaded areas give the 95% credible interval.


4. Explaining the Posterior More Explicitly

4.1 Combining Prior and Likelihood

From a conceptual standpoint, Bayesian regression says:

\[ \text{Posterior}(\beta_0, \beta_1 \mid \text{data}) \;\propto\; \text{Prior}(\beta_0, \beta_1) \;\times\; \text{Likelihood}(\text{data} \mid \beta_0, \beta_1). \]

  1. Prior: What values of \(\beta_0, \beta_1\) are plausible before seeing any data?

  2. Likelihood: Given a candidate pair \((\beta_0, \beta_1)\), how well does it explain our observed \(y\) values?

  3. Posterior: The result of multiplying these together and renormalizing into a probability distribution.

The best way to see this in action is to do a “prior vs. posterior” plot for (say) the slope. Let’s do a quick example of customizing a prior so we can illustrate how it gets updated.

4.1.1 A Very Simple Custom Prior Example

Suppose we suspect the slope is likely around 1.0, with a standard deviation of 2. That means slopes near 1 are more plausible, but we allow for a broad range. Likewise, for the intercept, maybe we suspect a prior mean of 0 with a standard deviation of 10.

custom_bayes_model <- stan_glm(
  y ~ x,
  data = df,
  prior = normal(location = 1, scale = 2, autoscale=FALSE),       # slope prior
  prior_intercept = normal(location = 0, scale = 10, autoscale=FALSE),
  chains=2, iter=2000, seed=123
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.023 seconds (Warm-up)
Chain 1:                0.036 seconds (Sampling)
Chain 1:                0.059 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 5e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.022 seconds (Warm-up)
Chain 2:                0.037 seconds (Sampling)
Chain 2:                0.059 seconds (Total)
Chain 2: 
print(custom_bayes_model, digits=3)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 50
 predictors:   2
------
            Median MAD_SD
(Intercept) 2.378  0.344 
x           3.335  0.117 

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.156  0.116 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Now, let’s extract the prior and posterior draws and plot them. We can sample from the prior if we specify prior_PD = TRUE (prior predictive distribution).

# Step 1: sample from prior only (no data used)
prior_only_model <- stan_glm(
  y ~ x,
  data = df,
  prior = normal(location = 1, scale = 2, autoscale=FALSE),
  prior_intercept = normal(location = 0, scale = 10, autoscale=FALSE),
  chains=2, iter=2000, seed=123,
  prior_PD = TRUE   # <--- This means: ignore the likelihood of data
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 8e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.024 seconds (Warm-up)
Chain 1:                0.025 seconds (Sampling)
Chain 1:                0.049 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 4e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.013 seconds (Warm-up)
Chain 2:                0.041 seconds (Sampling)
Chain 2:                0.054 seconds (Total)
Chain 2: 
# Extract draws
prior_draws <- as.matrix(prior_only_model, pars=c("(Intercept)","x"))
post_draws <- as.matrix(custom_bayes_model, pars=c("(Intercept)","x"))
library(ggplot2)
library(tidyr)

df_draws <- data.frame(
  prior_slope  = prior_draws[,"x"],
  post_slope   = post_draws[,"x"]
)

df_long <- df_draws %>% 
  pivot_longer(cols = everything(), 
               names_to = "distribution", values_to="slope_value") %>%
  mutate(distribution = ifelse(
    distribution == "prior_slope", "Prior Distribution", 
    "Posterior Distribution"))

ggplot(df_long, aes(x=slope_value, fill=distribution)) +
  geom_density(alpha=0.4) +
  scale_fill_manual(values=c("Prior Distribution" = "#D55E00", 
                             "Posterior Distribution" = "#0072B2")) +
  labs(title="Slope: Prior vs Posterior",
       x="Slope Value", 
       fill = "Distribution") +
  theme_minimal(base_size = 14)

  • The red or orange curve will show the prior distribution for the slope (centered near 1, quite wide).

  • The blue curve will show the posterior after seeing the data. Because our true slope is actually 3.5, you’ll see that the posterior is pulled far to the right of the prior’s center of 1. The data “pushes” the posterior to align with the actual effect in the data, reducing uncertainty.

Posterior Predictive Distribution: Why It Matters

So far, we’ve focused on estimating the posterior distribution of our regression parameters. However, in most real-world applications, we’re not just interested in estimating coefficients—we want to use our model to make predictions about future observations.

In a frequentist regression, we typically obtain a point prediction:

\[ \hat{y}_* = \hat{\beta}_0 + \hat{\beta}_1 x_* \]

along with a prediction interval that accounts for both residual variability and parameter uncertainty.

The Bayesian Approach: Uncertainty in Predictions

In Bayesian regression, we don’t just compute a single best guess for \(y_*\)—we generate an entire posterior predictive distribution, which tells us:

  1. The most likely values of \(y_*\) based on our model.

  2. The full range of plausible values, incorporating both parameter uncertainty and residual variability.

  3. The probability of exceeding a critical threshold, which is extremely useful for decision-making.

Example Use Case:

  • Suppose \(x_* = 2.5\) represents the duration of a hospital stay, and \(y_*\) represents the total cost of treatment (in thousands).

  • A hospital administrator might ask:

    • “What is the probability that this patient’s costs will exceed $10,000?”

    • “How much uncertainty is there in our cost estimate?”

  • Instead of a single prediction, the Bayesian posterior predictive distribution provides a full range of possible outcomes, making it much more informative.

Simulating from the Posterior Predictive Distribution

We generate predictions for a new observation \(x_*\) by sampling from the posterior:

\[ y_* \sim \text{Normal}(\beta_0 + \beta_1 x_*, \sigma^2) \]

where \(\beta_0, \beta_1, \sigma\) are drawn from their posterior distributions.

Let’s compute this for \(x_* = 2.5\) and visualize the range of plausible values for \(y_*\).

# Posterior draws for intercept, slope, and sigma
post_draws_full <- as.matrix(custom_bayes_model)

# Function to simulate predictions from posterior
simulate_posterior_y <- function(x_star, n_sims=4000) {
  idx_int <- which(colnames(post_draws_full)=="(Intercept)")
  idx_slope <- which(colnames(post_draws_full)=="x")
  idx_sigma <- which(colnames(post_draws_full)=="sigma")

  intercept_samples <- post_draws_full[, idx_int]
  slope_samples     <- post_draws_full[, idx_slope]
  sigma_samples     <- post_draws_full[, idx_sigma]

  # Compute predicted means, then sample from Normal(mean, sigma)
  mu_star <- intercept_samples + slope_samples * x_star
  y_sim <- rnorm(n_sims, mean=mu_star, sd=sigma_samples)
  y_sim
}

x_star <- 2.5
y_sim <- simulate_posterior_y(x_star)

# Probability that y_* > 10
mean(y_sim > 10)
[1] 0.72675

A histogram of simulated outcomes shows the full range of possible \(y_*\) values, allowing us to make probability-based decisions.

library(ggplot2)

ggplot(data.frame(y_sim), aes(y_sim)) +
  geom_histogram(bins=30, fill="#D55E00", color="white") +
  theme_minimal(base_size = 14) +
  geom_vline(xintercept=10, color="black", linetype="solid", size=1.5) +
  labs(title="Posterior Predictive Distribution at x=2.5",
       subtitle="Black line = threshold of 10",
       x="Possible y* values", y="Frequency")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

In this case, the probability that \(y_*\) exceeds $10,000 is approximately 73%. This information is crucial for making informed decisions about resource allocation, risk management, or policy interventions.

Key Takeaways: Why This Matters

Unlike traditional frequentist regression, Bayesian posterior predictive distributions allow us to answer probabilistic questions about new data points:

  • Point Estimate + Full Uncertainty: Instead of a single predicted \(y_*\), we get a distribution over plausible values.

  • Probability-Based Decisions: We can compute the probability that an outcome exceeds (or falls below) a critical threshold.

  • More Robust Uncertainty Quantification: Since we account for both parameter uncertainty and residual variance, our predictions are more realistic.

Further Reading:

  • Introduction to Bayesian Statistics by William Bolstad.

  • Bayesian Data Analysis by Gelman et al. 

  • The rstanarm, brms, or bayesplot packages in R for specifying, fitting, and visualizing Bayesian models.

Copyright 2024, Jacob Jameson