1

I have data such as this. My actual data contains over 500 variables and 2000 rows. Most of the variables are numeric.

library(survey)
library(dplyr)

data_in <- read_table2("Q62_1   Q62_2   Q62_3   Q62_4   Q62_5   Q62_6   Q62_8   Q62_9   Q62_11  strata_num  fpl_num ID  wgt_part2B
    0   0   0   0   0   0   1   0   NA  28  1024    1   13.23574543
    NA  NA  NA  NA  NA  NA  NA  NA  1   56  1024    2   2.116895199
    1   0   0   1   0   0   1   1   NA  53  1024    3   3.570008516
    NA  NA  NA  NA  NA  NA  NA  NA  1   55  175 4   2.136456013
    NA  NA  NA  NA  NA  NA  NA  NA  1   65  1024    5   3.126420259
    NA  NA  NA  NA  NA  NA  NA  NA  1   48  1024    6   22.76417923
    0   0   0   1   0   0   1   0   NA  57  1024    7   41.29535294
    1   0   0   1   0   0   0   1   NA  50  1024    8   3.343874216
    0   1   0   0   1   0   1   0   NA  63  1024    9   4.042140961
    0   0   1   0   0   1   0   0   NA  66  175 10  2.071694136
    0   0   0   0   0   0   0   1   NA  3   1024    11  33.75452805
    1   1   1   1   1   1   1   1   NA  53  1024    12  3.676005363
    NA  NA  NA  NA  NA  NA  NA  NA  1   50  1024    13  1.816867232
    NA  NA  NA  NA  NA  NA  NA  NA  1   31  1024    14  7.386627674
    1   1   0   1   1   0   1   1   NA  43  1024    15  41.09143829
    1   0   0   0   0   0   0   0   NA  22  1024    16  2.053463221
    NA  NA  NA  NA  NA  NA  NA  NA  1   46  1024    17  2.977662086
    NA  NA  NA  NA  NA  NA  NA  NA  1   10  175 18  1.600314736
    1   1   0   1   0   0   0   0   NA  5   1024    19  11.9602499
    NA  NA  NA  NA  NA  NA  NA  NA  1   39  1024    20  2.177173615
    0   0   0   0   0   0   1   1   NA  17  1024    21  28.22195816
    NA  NA  NA  NA  NA  NA  NA  NA  NA  47  1024    22  1.565697789
    NA  NA  NA  NA  NA  NA  NA  NA  NA  65  1024    23  1.679090261
    0   0   1   0   0   0   1   0   NA  40  175 24  1.735284925
    0   0   0   0   1   0   1   1   NA  53  1024    25  1.60990274
    NA  NA  NA  NA  NA  NA  NA  NA  1   26  1024    26  1.949402809
    NA  NA  NA  NA  NA  NA  NA  NA  1   56  175 27  1.851846814
    1   0   0   0   1   0   1   1   NA  37  1024    28  16.71925735
    0   0   0   0   0   0   0   1   NA  63  1024    29  4.269656658
    NA  NA  NA  NA  NA  NA  NA  NA  NA  27  1024    30  1.471351266
    0   0   0   0   0   1   0   1   NA  70  1024    31  1.714126825
    1   1   0   1   1   0   1   0   NA  48  1024    32  4.113308907
    0   0   1   1   1   0   1   1   NA  44  175 33  2.039677382
    0   0   0   0   1   0   1   0   NA  32  1024    34  1.909546375
    ")

I set up the survey design such as this

SurveyDesign <- svydesign(id =~ID,
                          strata =~strata_num,
                          weights  = ~wgt_part2B, 
                          fpc =~fpl_num,
                          data = data_in)

I ran svymean on all the variables

svymean(reformulate(names(data_in)),SurveyDesign,na.rm=TRUE)

For some reason, all the means show as zero. When I run svymean on SOME of the variables, the mean shows up just fine.

Here is an example of svymean working with one of the variables

data_in2 <- data_in1 %>% select(matches("Q62_11|strata_num|fpl_num|ID|wgt_part2B"))

SurveyDesign <- svydesign(id =~ID,
                          # strata =~strata_num,
                          weights  = ~wgt_part2B, 
                          # fpc =~fpl_num,
                          data = data_in2)


svymean(reformulate(names(data_in2)),SurveyDesign,na.rm=TRUE)

Any suggestions??

NewBee
  • 990
  • 1
  • 7
  • 26
  • @Ronak Shah Any idea's for why this might be happening? – NewBee Oct 22 '20 at 15:43
  • 1
    Looks like you have NA present in every row? I don't use svymean but does it only consider complete cases? – Dason Oct 22 '20 at 15:47
  • @Dason I think that na.rm should remove should deal with that... additionally, the svymean works on Q62_11 even though there are many NA's. – NewBee Oct 22 '20 at 15:56
  • I think you misunderstood what I meant. There are NAs in every row. So if what it's doing is only looking at rows that are complete you won't have any data. Like I said I don't use the package and it would be against what I would expect but it would make sense for your data. – Dason Oct 22 '20 at 16:10
  • Also your example isn't reproducible unless you include a call to load the packages you're using – Dason Oct 22 '20 at 16:11

2 Answers2

2

are you running into this issue?

library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
vector_of_variables <- c( 'api00' , 'api99' )
result <- 
    lapply( 
        vector_of_variables , 
        function( w ) svymean( as.formula( paste( "~" , w ) ) , dclus1 , na.rm = TRUE ) 
    )

result <- lapply( result , function( v ) data.frame( variable = names( v ) , mean = coef( v ) , se = as.numeric( SE( v ) ) ) )

do.call( rbind , result )
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • Thanks this worked! I am not getting non zero means... but my SE are still 0... any idea why that would be the case? – NewBee Oct 22 '20 at 16:26
  • is it possible to specify rounding in your call code, so that if it is 0.001 it will show as such without rounding up? – NewBee Oct 22 '20 at 23:46
0

When you compute a set of means with svymean only observations with all those variables are used. That's because svymean estimates the covariance matrix for the means, so it can't use partially missing data. In your example there are no observations with values for all the variables.

You can do something like this to loop over variables one at a time

lapply(names(data_in)[1:8], 
   function(v) eval(bquote(svymean(~.(as.name(v)),SurveyDesign,na.rm=TRUE)))
)

and get answers like

> lapply(names(data_in)[1:8], 
+    function(v) eval(bquote(svymean(~.(as.name(v)),SurveyDesign,na.rm=TRUE)))
+ )
[[1]]
         mean     SE
Q62_1 0.38902 0.0399

[[2]]
         mean    SE
Q62_2 0.29171 0.057

[[3]]
          mean     SE
Q62_3 0.042812 0.0337

[[4]]
         mean     SE
Q62_4 0.49944 0.0345

[[5]]
         mean     SE
Q62_5 0.33809 0.0554

[[6]]
          mean     SE
Q62_6 0.033547 0.0337

[[7]]
         mean     SE
Q62_8 0.73399 0.0465

[[8]]
         mean     SE
Q62_9 0.62947 0.0471
Thomas Lumley
  • 1,893
  • 5
  • 8
  • I don't get that, unless you mean for `Q62_11` where it really is zero: the only observed value is `1`. – Thomas Lumley Oct 27 '20 at 00:43
  • You are right, sorry, it does work on my sample data, not my real data. I did replace all my NA's with 1 in my full dataset just to check if it would return non zero SE's... but it still does. What other reason could this be causing this? :( – NewBee Oct 27 '20 at 19:36