0

In this example, the lm function in R finds a very small correlation between two columns that ought to be systematically uncorrelated. (also, this only happens when asked to predict one way, but not the other). Is it a rounding error? This becomes a big issue when trying to use lm.cluster, which turns this rounding error into a nearly significant effect.

## why does this happen? ##

library(miceadds)

id = c(1, 1, 1, 1, 2, 2, 2, 2)
a <- c(5, 5, 5, 5, 1, 1, 1, 1)
b <- c(-0.5, 0.5, -0.5, 0.5, -0.5, 0.5, -0.5, 0.5)

df <- data.frame(id, a, b)
df

reg <- lm(data = df, b ~ a)
## no correlation 
summary(reg)
# Call:
# lm(formula = b ~ a, data = df)
# Residuals:
#    Min     1Q Median     3Q    Max 
#   -0.5   -0.5    0.0    0.5    0.5 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept) -7.494e-17  3.680e-01       0        1
# a            2.475e-17  1.021e-01       0        1
# Residual standard error: 0.5774 on 6 degrees of freedom
# Multiple R-squared:  2.696e-32,   Adjusted R-squared:  -0.1667 
# F-statistic: 1.618e-31 on 1 and 6 DF,  p-value: 1


reg <- lm(data = df, a ~ b)
## miniscule correlation
summary(reg)
# Call:
# lm(formula = a ~ b, data = df)
# Residuals:
#    Min     1Q Median     3Q    Max 
#     -2     -2      0      2      2 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)  
# (Intercept) 3.000e+00  8.165e-01   3.674   0.0104 *
# b           2.183e-16  1.633e+00   0.000   1.0000  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 2.309 on 6 degrees of freedom
# Multiple R-squared:  9.861e-32,   Adjusted R-squared:  -0.1667 
# F-statistic: 5.916e-31 on 1 and 6 DF,  p-value: 1

cluster_reg <- lm.cluster(data = df, a ~ b, cluster = "id")
summary(cluster_reg) ## nearly significant effect?!

The coefficient in all 3 regressions ought to be exactly 0, but for me the second and clustered regressions yield coefficients of 6.28e-16. Is this error unique to me? What could cause this, and how can I analyze data with this structure in a way that avoids this issue?

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 6.28e-16 is for all intents and purposes 0. It's 0.000000000000000628. Note that computers aren't actually great with numbers so sometimes rounding errors will occur. In what instance is that a problem exactly? What is the output you see? When I run your `summary(cluster_reg)` code I see p-values of 0.7150142 and 0.1649148, both of which aren't very significant and it prints out "R^2= 0" – MrFlick Jul 12 '23 at 18:40
  • I'm doing something similar for a larger dataset (for an academic paper), and something like this is included as part of a regression examining an interaction effect. When many more observations are included, the p-value for the main effect of this obviously systematically uncorrelated variable is less than .0001. – Carter Allen Jul 12 '23 at 18:48
  • 3
    If you've got a workflow that takes an effect on the order of 10^-16 and gives a p-value of 0.0001, I would look at the workflow as the issue, not the 10^-16. On my computer the "smallest positive number `x` such that `1 + x != 1` is `.Machine$double.eps` is `2.220446e-16`. If your procedure labels anything not exactly equal to 0 as highly significant, that's a problem with the procedure. – Gregor Thomas Jul 12 '23 at 18:52
  • Uh, ok -- @gregor-thomas, in this case, re: the last part of my question, what's a better way to analyze data like this (while clustering effects by participant) that would accurately determine whether there's an effect between these variables? I don't know of a more intuitive way. – Carter Allen Jul 12 '23 at 19:04
  • 1
    If you want advice on how to analyze your data, you should ask for help at [stats.se] instead. That's not a specific programming question that's appropriate for Stack Overflow. That's really dependent on your data and the modeling assumptions that would be appropriate for your data. – MrFlick Jul 12 '23 at 19:30
  • 1
    This is a variant of [R FAQ 7.31](https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f) or [Why are these numbers not equal?](https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal) – Rui Barradas Jul 12 '23 at 19:30
  • 2
    I agree with @GregorThomas and suspect this is an [XY problem](https://en.wikipedia.org/wiki/XY_problem). Also, bear in mind that with a large enough sample size ("a larger dataset") literally _any_ non-zero observed effect will be reported as statistically significant, even when the true value of the effect is known to be zero. But "statistically significant" is not the same as "practically relevant". – Limey Jul 13 '23 at 07:15

0 Answers0