7

I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables.

I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB).

I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. In my case, I should have sufficient RAM to work with the file.

Community
  • 1
  • 1
QuestionAnswer
  • 321
  • 4
  • 15
  • 1
    If you are comfortable with python, you could convert your dta file to a csv file. The SO link [Convert Stata .dta file to CSV without Stata software](http://stackoverflow.com/questions/2536047/convert-stata-dta-file-to-csv-without-stata-software) describes this in one of the answers (not the top answer). – steveb Aug 08 '16 at 02:51
  • If you have enough ram, `foreign::read.dta()` should work, but it doesn't work on the latest stata format. – shayaa Aug 08 '16 at 03:06
  • Perhaps I should have articulate better: the goal is to use R and do it QUICKLY. Read.dta is incredibly slow and I'm hoping to avoid converting the file to .csv. – QuestionAnswer Aug 08 '16 at 04:51
  • It's still conceivable that `dta` -> `csv` -> `data.table` would be your fastest option (although I hope not). If I were you I'd look through the results of `library(sos); findFn("stata dta")` and benchmark on a reasonable (1GB?) size subset. – Ben Bolker Aug 08 '16 at 14:58

3 Answers3

5

The fastest way to load a large Stata dataset in R is using the readstata13 package. I have compared the performance of foreign, readstata13, and haven packages on a large dataset in this post and the results repeatedly showed that readstata13 is the fastest available package for reading Stata dataset in R.

5

Since this post is the top of the search results, I re-ran the benchmarking on the current version of haven and readstata13. It seems that both packages at this point are comparable, and haven is slightly better. In terms of time-complexity, they both approximate linear as a function of number of lines.

plot showing run times of both of the packages, along with a best-fit line

Here is the code to run the benchmark:

sizes <- 10^(seq(2, 7, .5))

benchmark_read <- function(n_rows){
start_t_haven <- Sys.time()
maisanta_dataset <- read_dta("my_large_file.dta"), n_max = n_rows)
end_t_haven <- Sys.time()

start_t_readstata13 <- Sys.time()
maisanta_dataset <- read.dta13("my_large_file.dta", select.rows = n_rows)
end_t_readstata13 <- Sys.time()

tibble(size = n_rows, 
       haven_time = end_t_haven - start_t_haven, 
       readstata13_time = end_t_readstata13 - start_t_readstata13) %>% 
  return()
}

benchmark_results <-
lapply(sizes, benchmark_read) %>% 
  bind_rows()
2

I recommend the haven R package. Unlike foreign, It can read the latest Stata formats:

library(haven)
data <- read_dta('myfile.dta')

Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. My understanding is that haven wraps a C library, so it's probably your fastest option.

ChrisP
  • 5,812
  • 1
  • 33
  • 36