13

I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.

I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.

Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.

e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'

R code :

require(survey)
sv.design <- svydesign(ids =  ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))

pushes the memory used by R upto 6 GB and takes a very long time - user system elapsed 3.70 1.75 144.11

The equivalent operation in stata

svy: mean hhpers if htype == 1

completes almost instantaneously, giving me the same result.

Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata? Is there anything I can do to optimise the data and how R is working with it?

ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.

After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.

bldysabba
  • 131
  • 4
  • Did you try to `set virtual` in Stata to on? Can you compare the object sizes after doing that? – rmuc8 Apr 23 '15 at 11:36
  • 1
    What is the result of `str(hh)`? Could this be related to factors? – Roland Apr 23 '15 at 11:59
  • There are 917 variables in the dataset, so str(hh) doesn't really offer up anything immediately useful. Most variables seem to be int though. – bldysabba Apr 23 '15 at 14:24
  • set virtual does not have any effect in stata – bldysabba Apr 23 '15 at 14:24
  • try using `library(haven)` and `read_spss()` instead of `library(foreign)` and `read.spss()` – Anthony Damico Apr 24 '15 at 16:41
  • The description of your setup is not sufficiently complete. Guessing Windows of some flavor. We can try 20 questions, of course. What other programs do you have loaded at the time you start R? Have you tried starting your OS and R with no other programs loaded. You need to clarify sizes of file in RAM versus file. Use appropriate functions and show full output. `?"Memory-limits"`. Anthony would be the best person to opine about the claim of "equivalence" of hte code. – IRTFM May 04 '15 at 20:34
  • @BondedDust My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata. I have clarified sizes of file in RAM vs. file in my original question. It is a 730 MB file in Stata, and the Stata program occupies approximately 850 MB RAM according to task manager. In R, the object size using the 'object.size()' function is `2717423200 bytes' and the RAM occupied y the R session is 2251.9 MB. Are there any other details I can provide? – bldysabba May 05 '15 at 21:04
  • @BondedDust Also please note that I'm not claiming that the code is equivalent across the two softwares. I am only saying that the equivalent operation, that gets me the same result in stata takes far less time. I have explicitly asked how I can optimise the way R is working with the data, which to me certainly includes any changes in the code that would bring improvements. – bldysabba May 05 '15 at 21:13
  • @AnthonyDamico 'library(haven)' and 'read_spss()' actually result in an even larger object – bldysabba May 05 '15 at 21:13
  • Hm, looks like you need a package to get anything smaller than integer type. (Apparently, even booleans are not as small as they could be: http://stackoverflow.com/questions/9178254/why-do-logicals-booleans-in-r-require-4-bytes). – Frank May 05 '15 at 21:21
  • 1
    @Frank Yes, I also came across the 'ff' package that implements some functionality which would give access to smaller data types, but using these packages seems to significantly complicate analysis, and I'm not sure if you could even use it to perform analysis using other packages without translating data back and forth between standard R data types. – bldysabba May 05 '15 at 21:40
  • If the survey is publicly available (which in many cases are), you should provide the link so that other can test it and you will have answers instantly. – user227710 May 17 '15 at 02:05
  • @bldysabba http://asdfree.com/ hosts r code to work with many public use survey data sets. the ones too big to fit in ram on 4GB machines are easily stored in database-backed survey objects. i'm not intending to answer your question here, only mentioning this because the data size loaded in ram doesn't really matter for survey analysis with r. specific examples: https://github.com/ajdamico/usgsd/search?utf8=%E2%9C%93&q=RSQLite+OR+MonetDB.R&type=Code – Anthony Damico May 30 '15 at 18:16

2 Answers2

1

Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).

Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with

hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument

Sorry, I'm not familiar with survey data studies, so I can't be more specific.

Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.

Sergii Zaskaleta
  • 502
  • 1
  • 4
  • 21
0

Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)

From the memisc vignette

... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.

jmk
  • 466
  • 1
  • 4
  • 21