I have survey data in SPSS
and Stata
which is ~730 MB
in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB
) in the memory if I'm working with that data.
I've been trying to pick up R
, and so attempted to load this data into R
. No matter what method I try(read.dta
from the stata
file, fread
from a csv
file, read.spss
from the spss
file) the R
object(measured using object.size()
) is between 2.6 to 3.1 GB
in size. If I save the object in an R
file, that is less than 100 MB
, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset
the data, take significantly longer than the equivalent command in stata
.
e.g I have a household size variable 'hhpers'
in my data 'hh'
, weighted by variable 'hhwt'
, subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R
and Stata
?
Is there anything I can do to optimise the data and how R
is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.