I have billions of measurements for two values, x
and y
. This is too large to operate on the raw data, so I'm representing them as a frequency table. I have one row for each unique combination of x
value and y
value, and a variable freq
showing how many data points had that combination of values.
If I want to estimate the relationship between x and y, I can do: lm(y ~ x, data=df, weights=df$freq)
. I've tested this and it gives accurate parameter estimates, but the wrong t
value. It's still treating each row as one observation, so the degrees of freedom are much smaller than they should be.
- Is there a way to run analyses that treats each row as the appropriate number of records?
- Are there generalizable tools for having R operate on a frequency table as if it were a raw dataset?
note: this question shows how to recreate the raw data, but my raw data is unmanagably large, which is why I'm using a frequency table in the first place.
example
# This dataset has a negative correlation between x and y:
library(dplyr)
raw_data<-data.frame(
x=rep(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4), 100),
y=rep(c(5,5,5,5,1,4,4,4,4,1,3,3,3,3,7,2,2,2,2,8), 100)
)
lm_raw<-lm(x ~ y, data=raw_data)
summary(lm_raw)[c("coefficients", "df")]
# Let's say instead I have a have a summary dataset that has the frequency for each x-y pair:
freq_data <- raw_data %>% group_by(x,y) %>% summarise(freq=n())
# Analyze and weight by frequency. Parameter estimates are right but the t value is wrong:
lm_freq<-lm(x ~ y, data=summh, weights=summh$freq)
summary(lm_freq)$coefficients
# ... because it's treating this as 8 data points instead of thousands
summary(lm_freq)$df