Statistically identical lm() and t.test() yield marginally different effect sizes

Question

I'm encountering an unexpected lack of correspondence between the estimated effects of an unpaired t.test() and lm() in base R's stats. While an independent-samples t-test correctly yields t = 0 and a manual calculation of OLS slope also outputs an estimate of 0, lm() gives a near-zero approximation (specifically, -3.239e-14).

This pattern holds across machines and OS (Mac and Windows) versions as tested by my coworkers. Would anyone be able to explain the discrepancy? The most similar post does not seem to apply here.

# Populating an example dataframe
id <- rep(1, 16)
dose <- rep(c(rep("Low", 4), rep("High", 4)), 2)
domain <- c(rep("Domain1", 8), rep("Domain2", 8))
dv <- c(100, 0, 100, 100, 0, 100, 100, 100, 0, 0, 0, 0, 0, 100, 50, 0)

df <- data.frame(id, dose, domain, dv)

library(dplyr) ## needed for %>% and filter()
# Independent-samples t-test
t.test(dv ~ dose,
       data = filter(df, domain == "Domain1") %>% droplevels(), 
       paired = F)

# data:  dv by dose
# t = 0, df = 6, p-value = 1

# Linear regression of factor IV and numeric DV 
summary(lm(as.numeric(dv) ~ as.factor(dose), 
           data = filter(df, domain == "Domain1")))

# Coefficients:
#                      Estimate Std. Error t value Pr(>|t|)  
# (Intercept)         7.500e+01  2.500e+01       3    0.024 *
# as.factor(dose)Low -3.239e-14  3.536e+01       0    1.000

# Manual calculation of OLS slope
X <- model.matrix(dv ~ dose, data = filter(df, domain == "Domain2"))
y <- with(filter(df, domain == "Domain1"), dv)
R <- t(X) %*% X
solve(R) %*% t(X) %*% y

#             [,1]
# (Intercept)   75
# doseLow        0

Different algorithms, different rounding, different precision. With 16,000 data points, `-3.239e-14` is close enough to 0 to be treated as 0. With 16 data points, even more so. — Gregor Thomas, Sep 28 '22 at 18:52
Fair enough, thanks for the quick response! But is there a way to override the default behavior of the underlying algorithm to make the zero-value output display consistent across statistical tests? — nathan liang, Sep 28 '22 at 18:57
You can use `round()` on the results. It's not the underlying algorithm you want to adjust, it's the print methods. [Here's a related question](https://stackoverflow.com/q/15732738/903061) (without a more satisfying answer). If you use the `broom` package and put the results in a tibble, they may print more to your liking. — Gregor Thomas, Sep 28 '22 at 19:55

Ben Bolker · Accepted Answer · 2022-10-01T00:13:53.693

The t-statistic is exactly zero for t.test() (assign the results to tt and try tt$statistic == 0), while after assigning the results of summary() to ss we get

ss$coefficients["as.factor(dose)Low", "Estimate"]
[1] -3.014578e-14

(extracting the values and printing them avoids any rounding/formatting that might be done in the print() methods for these objects).

As discussed in the comments, this discrepancy is a more-or-less unavoidable aspect of floating point imprecision, and the fact that the two functions do mathematically equivalent calculations using very different numerical algorithms. (In hindsight it's not surprising that the t-test, which just subtracts the two means and scales them by the pooled SD, gets a more precise answer than lm(), which does some fancy linear algebra (QR decomposition: see lm.fit()) to arrive at the same answer.)

Unfortunately, it's not easy to adjust the printing precision low enough to make the coefficient print as zero: even

printCoefmat(coef(ss), digits = 1)

(a way to get just the summary table) still prints -3e-14.

A hack, using zapsmall() to round the coefficient values in such a way that coefficients that are small (in absolute value) relative to the largest coefficient get set to zero:

ss$coefficients[, "Estimate"] <- zapsmall(ss$coefficients[, "Estimate"])
printCoefmat(coef(ss))
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          75.000     25.000       3  0.02401 *
## as.factor(dose)Low    0.000     35.355       0  1.00000

You could also do this upstream with the output of lm(), or by creating a wrapper function to use in place of lm():

zlm <- function(...) {
   m <- lm(...)
   m$coefficients <- zapsmall(m$coefficients)
   m
}

Unfortunately there's not really any nice way to make things behave this way universally; it's just too hard to know exactly when it's safe/desirable to round small values to zero ...

Thank you so much, Ben, for this really detailed answer. I (and my labmates) really appreciate the explanation!! — nathan liang, Oct 01 '22 at 00:29

Statistically identical lm() and t.test() yield marginally different effect sizes

1 Answers1