Plotting binned regression estimates in R

I work a lot with binned regressions, i.e., regressions that estimate the effect of a variable that has been discretized into bins. This is common in applied microeconomics, where researchers often want to estimate the effect of a continuous variable (e.g., temperature, income, distance) on an outcome while allowing for nonlinearities by binning the variable. The canonical estimating equation here is something like:

$y_{i} = \sum_{b = 1}^{B} β_{b} 1 [x_{i} \in bin b] + ε_{i}$

where $y_{i}$ is the outcome, $x_{i}$ is the continuous independent variable that has been binned into $B$ bins, and $ε_{i}$ is the error term.

Anyway, I’ve found myself writing the same code over and over to estimate models using a binned independent variable and then to plot those estimates. To save my future self some time, this post documents a general workflow for estimating, extracting, and plotting coefficient estimates from binned models.

Setup

We’ll use the excellent fixest for estimation and the tidyverse for data manipulation and plotting.

pacman::p_load(tidyverse, fixest, cowplot)
theme_set(theme_cowplot())
set.seed(42)

Generate fake data

Let’s create some fake data where the outcome has a nonlinear relationship with temperature. I’ll generate data that shows “performance” (e.g., productivity or athletic performance) declining at high temperatures.

n <- 500

data <- tibble(
  id = 1:n,
  group = sample(LETTERS[1:10], n, replace = TRUE),
  temperature = rnorm(n, mean = 20, sd = 10),
  noise = rnorm(n, 0, 3)
) %>%
  mutate(
    # Create a nonlinear effect: small effect until temp > 15, then larger negative effect
    performance = 100 - 0.1 * pmax(temperature - 15, 0)^1.5 + noise,
    # Bin the temperature variable
    temp_bin = cut(temperature, breaks = c(-Inf, seq(0, 40, 5), Inf))
  )

glimpse(data)

Rows: 500
Columns: 6
$ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ group       <chr> "A", "E", "A", "I", "J", "D", "B", "J", "A", "H", "G", "D"…
$ temperature <dbl> 24.0143625, 4.9065749, 8.6163413, 23.3774120, 5.2737186, 1…
$ noise       <dbl> -4.2756753, -1.4613355, -0.8006003, 0.6995681, -1.2175681,…
$ performance <dbl> 93.01786, 98.53866, 99.19940, 98.27483, 98.78243, 101.8180…
$ temp_bin    <fct> "(20,25]", "(0,5]", "(5,10]", "(20,25]", "(5,10]", "(15,20…

Estimate the model

We’ll estimate a simple model with binned temperature and group fixed effects. The reference category is the bin containing temperatures around 15-20°C (where we expect performance to be close to baseline).

fit <- feols(performance ~ i(temp_bin, ref = "(15,20]") | group, data = data)

etable(fit)

                                           fit
Dependent Var.:                    performance
                                              
temp_bin = (-Inf,0]15,20]")    0.5785 (0.8412)
temp_bin = (0,5]15,20]")       0.4285 (0.6700)
temp_bin = (5,10]15,20]")      0.1270 (0.5590)
temp_bin = (10,15]15,20]")    0.8427. (0.5032)
temp_bin = (20,25]15,20]")  -2.253*** (0.4723)
temp_bin = (25,30]15,20]")  -4.087*** (0.5074)
temp_bin = (30,35]15,20]")  -6.915*** (0.5860)
temp_bin = (35,40]15,20]")  -10.46*** (0.7569)
temp_bin = (40,Inf]15,20]") -15.00*** (0.9396)
Fixed-Effects:              ------------------
group                                      Yes
___________________________ __________________
S.E. type                                  IID
Observations                               500
R2                                     0.58630
Within R2                              0.57832
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extract and clean coefficients

Now we can extract the coefficients. We do that by saving the output of coefplot, which gives us a tidy data frame of estimates in the prms object. Note that calling coefplot yields a pretty good plot by default! So you may want to stop here if you’re just doing a quick visualization.

# Get coefficient estimates with confidence intervals
coef_df <- coefplot(fit, keep = "temp_bin")$prms

Assuming we do want to customize the plot further, we can now extract the bin limits and calculate midpoints for plotting. I wrote a helper function that parses the bin names and extracts the lower and upper bounds. It also (optionally) corrects infinite bounds to finite values based on the bin width.

Finally, the body of the code itself also includes the addition of the reference category (which has an estimate of 0 by definition) and arranges the data frame by bin midpoints.

extract_limits_from_bin_names <- function(x, fix_Inf = TRUE) {
  # Extract intervals from coefficient names like "temp_bin::(0,5]"
  intervals_chr <- str_extract(x, "\\(([^,]+),([^\\]]+)\\]")
  
  # Parse the lower and upper bounds
  intervals <- str_match(intervals_chr, "\\(([^,]+),([^\\]]+)\\]")
  
  intervals_df <- data.frame(
    bin_low = as.numeric(intervals[, 2]),
    bin_high = as.numeric(intervals[, 3])
  )
  
  # Replace infinite bounds with finite values based on bin width
  if (fix_Inf) {
    bin_size <- intervals_df$bin_high - intervals_df$bin_low
    bin_size <- min(bin_size[is.finite(bin_size)])
    
    intervals_df <- intervals_df %>%
      mutate(
        bin_low = if_else(is.infinite(bin_low), bin_high - bin_size, bin_low),
        bin_high = if_else(is.infinite(bin_high), bin_low + bin_size, bin_high)
      )
  }
  
  return(intervals_df)
}

# Extract bin limits and calculate midpoints
coef_df <- coef_df %>%
  bind_cols(extract_limits_from_bin_names(coef_df$estimate_names)) %>%
  mutate(bin_mid = (bin_low + bin_high) / 2)

# Add the reference category (estimate = 0)
ref_df <- tibble(
  estimate_names = "(15,20]",
  estimate = 0,
  ci_low = 0,
  ci_high = 0,
  bin_low = 15,
  bin_high = 20,
  bin_mid = 17.5
)

coef_df <- bind_rows(coef_df, ref_df) %>%
  arrange(bin_mid)

glimpse(coef_df)

Rows: 10
Columns: 11
$ estimate           <dbl> 0.5785088, 0.4284605, 0.1269886, 0.8427303, 0.00000…
$ ci_low             <dbl> -1.0742853, -0.8879347, -0.9714904, -0.1460390, 0.0…
$ ci_high            <dbl> 2.231303, 1.744856, 1.225468, 1.831500, 0.000000, -…
$ estimate_names     <chr> "temp_bin::(-Inf,0]15,20]\")", "temp_bin::(0,5]15,2…
$ estimate_names_raw <chr> "temp_bin::(-Inf,0]15,20]\")", "temp_bin::(0,5]15,2…
$ id                 <dbl> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1
$ x                  <dbl> 1, 2, 3, 4, NA, 5, 6, 7, 8, 9
$ y                  <dbl> 0.5785088, 0.4284605, 0.1269886, 0.8427303, NA, -2.…
$ bin_low            <dbl> -5, 0, 5, 10, 15, 20, 25, 30, 35, 40
$ bin_high           <dbl> 0, 5, 10, 15, 20, 25, 30, 35, 40, 45
$ bin_mid            <dbl> -2.5, 2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5, 37.5,…

Plot the results

ggplot(coef_df, aes(x = bin_mid, y = estimate)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_ribbon(aes(ymin = ci_low, ymax = ci_high), alpha = 0.2, fill = "steelblue") +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 2) +
  labs(
    x = "Temperature (°C)",
    y = "Change in performance relative to 15-20°C",
    title = "Effect of temperature on performance"
  ) +
  theme_cowplot()

The plot clearly shows the nonlinear effect we built into the data: performance is relatively stable at cooler temperatures but declines sharply once temperatures exceed about 20°C.

Wrapping up

This workflow is handy whenever you’re working with binned variables and want to visualize the estimates. The key steps are:

Create bins using cut()
Estimate the model with i() syntax in fixest
Extract coefficients using coefplot()
Parse bin limits from coefficient names
Add the reference category manually
Plot using ggplot2

The extract_limits_from_bin_names() function does the heavy lifting of parsing the interval notation, which saves you from manually typing out bin midpoints or limits.