0

Im trying to calculate weighted confidence intervals and need to use the a weighted mean to do so. But I keep running into the same failure that I cant figure out how to solve. The data is based on the European Social Survey and I have loaded the following libraries:

library(tidyverse)
library(haven) 
library(essurvey) 
library(radiant.data) 

The following code should output, among others, confidence intervals:

ESS %>% # Use the ESS, then
  transmute( # Create new variables and only keep these new ones
    # Make the following variables factors:
    cntry = as_factor(cntry), 
    # Make the following variables numeric:
    pspwght = zap_labels(pspwght),
    hmsacld = max(zap_labels(hmsacld), na.rm = TRUE) - zap_labels(hmsacld), #Turning scale around
  ) %>%
  group_by(cntry) %>% # Group data by country, then
  summarize(
    n = sum(pspwght, na.rm = TRUE),
    mean_hmsacld = weighted.mean(hmsacld, pspwght, na.rm = TRUE), 
    sd_hmsacld = weighted.sd(hmsacld, pspwght), 
    se_hmsacld = sd_hmsacld / sqrt(n),
    min95 = mean_hmsacld - se_hmsacld * qt(p = 0.975, df = n),
    max95 = mean_hmsacld + se_hmsacld * qt(p = 0.975, df = n)
  )

Instead, I get the following error:

Error in weighted.mean.default(x, wt) : 
  'x' and 'w' must have the same length

Any idea how to fix this?

Thanks

SnupSnurre
  • 363
  • 2
  • 12
  • Hi KasperA. Please read the info about [how to ask a good question](https://stackoverflow.com/help/how-to-ask) and how to give a [minimale reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610).That way you can help others to help you! – dario Feb 19 '20 at 15:58
  • How you obtain dataframe ESS? – Pawel Stradowski Feb 19 '20 at 16:07
  • @PawelStradowski, I use the library and the following code: `ESS <- import_rounds(rounds = 8, ess_email = "ENTER EMAIL")` – SnupSnurre Feb 19 '20 at 20:05
  • One needs to register in order to retrieve data and I don't want to do this. Could you please post result of dput(head(ESS, 20)) - I am not sure if it will be eough for a reprex, but let's try this – Pawel Stradowski Feb 19 '20 at 20:23
  • @PawelStradowski, Sure. Thank you for your effort! The output of the dput-function was simply too long to post here so [here is a Pastebin-link instead](https://pastebin.com/KwrNKaVd). – SnupSnurre Feb 20 '20 at 07:09

1 Answers1

0

You have NA in hmsacld column, which causes the error you observe. If you want weighted.mean, you need to provide weights for each pspwght. A simple experiment -let's drop all rows with NA inside ESS:

library(tidyverse)
library(haven)
library(essurvey) 
library(radiant.data)

ESS <- structure(list(cntry = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AT", 
"BE", "CH", "CZ", "DE", "EE", "ES", "FI", "FR", "GB", "HU", "IE", 
"IL", "IS", "IT", "LT", "NL", "NO", "PL", "PT", "RU", "SE", "SI"
), class = "factor"), pspwght = c(1.17849552631378, 0.899471521377563, 
0.31575334072113, 0.472467392683029, 2.24670553207397, 1.01137900352478, 
1.83802974224091, 1.20280182361603, 0.320830971002579, 0.99757444858551, 
0.550059616565704, 0.691191911697388, 0.411176264286041, 0.673080623149872, 
1.28033947944641, 0.647780179977417, 2.93387079238892, 0.374067783355713, 
0.696788847446442, 0.699867308139801), hmsacld = c(4, 4, 2, 3, 
4, 1, NA, 2, 2, 1, 4, 2, 3, 3, 0, 4, 3, 1, 3, 4)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L))


ESS %>% # Use the ESS, then
  transmute( # Create new variables and only keep these new ones
    # Make the following variables factors:
    cntry = as_factor(cntry), 
    # Make the following variables numeric:
    pspwght = zap_labels(pspwght),
    hmsacld = max(zap_labels(hmsacld), na.rm = TRUE) - zap_labels(hmsacld), #Turning scale around
  ) %>%
  drop_na() %>% 
  group_by(cntry) %>% # Group data by country, then
  summarize(
    n = sum(pspwght, na.rm = TRUE),
    mean_hmsacld = weighted.mean(hmsacld, pspwght, na.rm = TRUE), 
    sd_hmsacld = weighted.sd(hmsacld, pspwght), 
    se_hmsacld = sd_hmsacld / sqrt(n),
    min95 = mean_hmsacld - se_hmsacld * qt(p = 0.975, df = n),
    max95 = mean_hmsacld + se_hmsacld * qt(p = 0.975, df = n)
  )

Regards Paweł

Pawel Stradowski
  • 807
  • 7
  • 13
  • Thank you! It works perfect now and makes sense with the missing values. I got one clarifying questions though. We define hmsacld as the following: `hmsacld = max(zap_labels(hmsacld), na.rm = TRUE) - zap_labels(hmsacld)` I thought `na.rm = TRUE` would remove the missing values? Why didn't it do so? – SnupSnurre Feb 21 '20 at 08:07
  • Look at second term of your equation -zap_labels(hmsacld), na.rm is only in first term - max function – Pawel Stradowski Feb 21 '20 at 08:42