5

I have billions of measurements for two values, x and y. This is too large to operate on the raw data, so I'm representing them as a frequency table. I have one row for each unique combination of x value and y value, and a variable freq showing how many data points had that combination of values.

If I want to estimate the relationship between x and y, I can do: lm(y ~ x, data=df, weights=df$freq). I've tested this and it gives accurate parameter estimates, but the wrong t value. It's still treating each row as one observation, so the degrees of freedom are much smaller than they should be.

  • Is there a way to run analyses that treats each row as the appropriate number of records?
  • Are there generalizable tools for having R operate on a frequency table as if it were a raw dataset?

note: this question shows how to recreate the raw data, but my raw data is unmanagably large, which is why I'm using a frequency table in the first place.

example

# This dataset has a negative correlation between x and y:
library(dplyr) 
raw_data<-data.frame(
  x=rep(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4), 100),
  y=rep(c(5,5,5,5,1,4,4,4,4,1,3,3,3,3,7,2,2,2,2,8), 100)
)

lm_raw<-lm(x ~ y, data=raw_data)
summary(lm_raw)[c("coefficients", "df")]

# Let's say instead I have a have a summary dataset that has the frequency for each x-y pair:

freq_data <- raw_data %>% group_by(x,y) %>% summarise(freq=n())

# Analyze and weight by frequency. Parameter estimates are right but the t value is wrong:

lm_freq<-lm(x ~ y, data=summh, weights=summh$freq)
summary(lm_freq)$coefficients   

# ... because it's treating this as 8 data points instead of thousands

summary(lm_freq)$df
octern
  • 4,825
  • 21
  • 38
  • Cannot run your code as some object are missing. You can have more information about the weight parameter in `?lm`. – DJJ Jul 09 '20 at 21:01
  • The `weights` parameter in lm is NOT for case weights. The `survey` package might be a place to go since the documentation makes clear that there are multiple types of "weighting" avaialble. – IRTFM Jul 09 '20 at 21:10
  • you can recover the parameter covariance matrix with `vcov(lm_freq) / (lm_raw$df.residual/lm_freq$df.residual)` (and you can calculate lm_raw$df.residual manually) – user2957945 Jul 09 '20 at 22:15

1 Answers1

0

You can manually adjust the degrees of freedom:

lm_freq$df.residual <- with(lm_freq, sum(weights) - length(coefficients))

Now you should get the correct t-values. I referenced this article.

  • I do not think you read that article correctly. I'm pretty sure it's more complicated that just adjusting the degrees of freedom. That article specifically says the inferential statistics are wrong if you use `lm`. Try reading the citation to a duplicate that I'm offering. – IRTFM Jul 09 '20 at 21:23