10

Sorry if this is really obvious, but I can't see how to do a simple Pearson correlation between two variables in the survey package. My data has strata so it would be the equivalent to finding r for api00 and api99 in apistrat.

library(survey)
data(api)

dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

I'm sure there must be a simple way of doing it using svyvar or svyglm or something but I can't see it?

CWD
  • 113
  • 1
  • 7
  • cor(apistrat$api00, apistrat$api99) [1] 0.9741931 The default is Pearson correlation. – Ven Yao Dec 22 '15 at 17:16
  • Thanks but I don't think that would take account of the strata or the weights. That's why I was hoping for something in the survey package. – CWD Dec 22 '15 at 18:10
  • wtd.cor in the weights package might do the job, but I'm not sure. It also wouldn't help with more complex sampling designs. Something in the survey package would certainly be the most convenient option. – CWD Dec 22 '15 at 18:14

3 Answers3

11

You can use svyvar to estimate the variance-covariance matrix, and then scale it to the correlation:

library(survey)
data(api)

dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
v <- svyvar(~api00+api99, dstrat)

as.matrix(v)
cov2cor(as.matrix(v))

This works for any number of correlations and any design.

Thomas Lumley
  • 251
  • 2
  • 4
2
library(survey)
data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
summary(svyglm(api00~ell+meals+mobility, design=dstrat),correlation=T)
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • Hi, as far as I can tell svychisq is to do a measure of association for cross tabulated, categorical data? I'm after the correlation coefficient for two continuous variables. – CWD Dec 26 '15 at 18:21
  • Thanks for the edit Anthony. Could you elaborate any more on this please? For example how would you specify the model to get the the correlation between api00 and api99? This ought to be close to one whether or not the weighting and strata are used, but I can't get anything that looks right using this method. – CWD Dec 30 '15 at 10:23
  • `summary(svyglm(api99~api00, design=dstrat),correlation=T)` – Anthony Damico Dec 30 '15 at 15:11
  • I'm afraid I still can't see the Pearson's correlation between those two variables. From the regression coefficient for api00 and from the un-weighted correlation, I would expect it to be close to 1. – CWD Dec 30 '15 at 16:34
  • it says Correlation of Coefficients: (Intercept) api00 -0.98 – Anthony Damico Dec 30 '15 at 17:46
  • A coefficient of -0.98 would suggest an almost perfect negative correlation, when these two variables are almost perfectly positively correlated. I think those correlations in the matrix must be referring to something else in the modelling process? – CWD Dec 30 '15 at 18:28
  • i believe they're just negated but you can email the author of the `survey` package to confirm – Anthony Damico Dec 30 '15 at 21:05
1

I've been thinking around the problem a bit and I'm starting to think the best way forward might be to just scale both of the variables first, presumably using svymean and svyvar.

dstrat2 <- transform(dstrat, 
                     z_api99 = (api99 - svymean(~api99,    dstrat))/sqrt(svyvar(~api99, dstrat)), 
                     z_api00 = (api00 - svymean(~api00, dstrat))/sqrt(svyvar(~api00, dstrat))) 

svyglm(z_api99 ~ z_api00, dstrat2)$coefficients

This gives 9.759047e-01, which is the same result as using:

library(weights)
wtd.cor(apistrat$api99, apistrat$api00, weight = apistrat$pw)

It has the advantage that it could be used with pretty much any survey design type. It also provides a way of getting the standardised beta coefficients if there are more variables. It isn't quite what I was after with the original question, but it might be the best way if there isn't a specific option.

If anyone else can confirm if this works or not, or if there is a better way, then I'd be very grateful for any further comments.

CWD
  • 113
  • 1
  • 7