I understand that lm
treats weights
as "analytic" weights, meaning that observations are just weighted against each other (e.g. lm
will weigh an observation with weight
= 2 twice as much as one with weight
= 1), and the overall N
for the model is unaffected. "Frequency" weights, on the other hand, would allow the model to have a different N
than the actual number of observations in the data.
People have asked about frequency weights in R before, but as far as I can tell prior questions have been concerned with survey data. I am not using survey data for this question.
I'd like to implement frequency weights that are less than 1, and which cause the model's N
to be smaller than the actual number of rows in the data. For example, if nrow(df)
= 8 and all observations have weight
= 0.5, the model N
should be 4, and the standard errors should reflect this difference. The weights for base R's lm
can't be used this way, as far as I can tell:
library(tidyverse)
library(broom)
df.unweighted <- tribble(
~x, ~y, ~w,
0, 10, 1,
0, 20, 1,
1, 40, 1,
1, 50, 1,
) %>%
bind_rows(., .) # make twice as large
df.weighted <- df.unweighted %>%
mutate(w = 0.5)
lm(data=df.unweighted, y~x, weights=w) %>%
tidy
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 15. 2.89 5.20 0.00202
#> 2 x 30 4.08 7.35 0.000325
lm(data=df.weighted, y~x, weights=w) %>%
tidy
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 15. 2.89 5.20 0.00202
#> 2 x 30.0 4.08 7.35 0.000325
# identical
What I'm looking for can be achieved in stata
using iweights
. Note the model N
and standard errors:
library(RStata)
stata("reg y x [iweight=w]",
data.in = df.weighted)
#> . reg y x [iweight=w]
#>
#> Source | SS df MS Number of obs = 4
#> -------------+------------------------------ F( 1, 2) = 18.00
#> Model | 900 1 900 Prob > F = 0.0513
#> Residual | 100 2 50 R-squared = 0.9000
#> -------------+------------------------------ Adj R-squared = 0.8500
#> Total | 1000 3 333.333333 Root MSE = 7.0711
#>
#> ------------------------------------------------------------------------------
#> y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> x | 30 7.071068 4.24 0.051 -.4243492 60.42435
#> _cons | 15 5 3.00 0.095 -6.513264 36.51326
#> ------------------------------------------------------------------------------
In my actual usage, not all observations will have the same weight. I just did that here for ease of demonstration.