3

I am trying to import a large dataset from Stata 13 into pandas using StataReader. This worked fine with pandas 0.13.1, but after I upgraded to 0.14.1, the ability to read .dta files seems to have drastically worsened. Does anybody know what has happened (I could not find any changes to StataReader in the "What's New" section of the pandas website), and/or how to get around this?

Steps to reproduce my issue:

  1. Create a large dataset in Stata 13:

    clear
    
    set obs 11500
    forvalues i = 1/8000{
    gen var`i' = 1
    }
    
    saveold bigdataset, replace
    
  2. Try to read it into pandas using StataReader:

    from pandas.io.stata import StataReader
    
    reader = StataReader('bigdataset.dta')
    data = reader.data()
    

Using pandas 0.13.1, this takes around 220 seconds, which is acceptable, but using pandas 0.14.1, nothing has happened even after waiting around 20 minutes.

When I test this issue with a smaller dataset:

  1. Create a smaller dataset in Stata 13:

    clear
    
    set obs 11500
    forvalues i = 1/1000{
    gen var`i' = 1
    }
    
    saveold smalldataset, replace
    
  2. Try to read it into pandas using StataReader:

    from pandas.io.stata import StataReader
    
    reader = StataReader('smalldataset.dta')
    data = reader.data()
    

Using pandas 0.13.1, this takes around 20 seconds, but using pandas 0.14.1, this takes around 300 seconds.

I would really like to upgrade to the new pandas version and work with my data, which is around the size of bigdataset.dta. Does anybody know a way I could efficiently import my data?

David
  • 31
  • 2
  • I don't know what's going on with pandas. I filed a bug weeks ago to which I'll link later. You seem to have Stata, so export to some other format that pandas will be happy to read, from within Stata. See the help files to easily do this. – Roberto Ferrer Aug 14 '14 at 23:10
  • I can't comment on pandas 13 vs 14, but I have found that pandas 14 reads in a CSV from stata way faster than a DTA, so I've been using stata outfile rather than save. It seems strange, given that binary ought to be more efficient than ascii, but I think they've put a lot of effort into reading CSVs and probably a lot less into reading DTAs (which is pretty understandable, all in all). – JohnE Aug 15 '14 at 01:42
  • These are two links related to your question: http://stackoverflow.com/a/24062015/2077064 and http://stackoverflow.com/a/19750420/2077064. See `help export` in Stata to review your options. You should [report this problem](https://github.com/pydata/pandas/issues) to pandas developers. There are in fact changes to Stata IO facilities in pandas 0.14.1. Just search for the string "stata" [here](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html). – Roberto Ferrer Aug 15 '14 at 02:42

1 Answers1

0

For anyone who has stumble upon this and is interested in the answer - I posted this issue on the pandas Github page as per Roberto's suggestion, and they have found and fixed the performance issue. It works great using their master branch right now!

David
  • 31
  • 2
  • Thanks David. Can you please give a link to the GitHub page where the problem is reported? – Roberto Ferrer Aug 19 '14 at 21:24
  • I posted it here: https://github.com/pydata/pandas/issues/8040 and the discussion continued here: https://github.com/pydata/pandas/pull/8045 – David Aug 20 '14 at 21:48