2

I am using Stata to process some data, export the data in a csv file and load it in Python using the pandas read_csv function.

The problem is that everything is so slow. Exporting from Stata to a csv file takes ages (exporting in the dta Stata format is much faster), and loading the data via read_csv is also very slow. Using the read_stata pandas function is even worse.

I wonder is there are any other options? Like exporting a format other than csv? My csv dataset is approx 6-7 Gb large.

Any help appreciated

Thanks

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    `read_stata()` is much faster starting with version 15.0 of pandas, so make sure you are up to date. – JohnE Apr 30 '15 at 17:35

1 Answers1

2

Pretty efficient pd.read_stata()/.to_stata(), see here

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks jeff but it appears that loading stata large datasets in pandas is even slower than using csv... – ℕʘʘḆḽḘ Apr 30 '15 at 17:26
  • 1
    @Noobie make sure you are using pandas 15.0 or higher which is much faster at reading DTA files than version 14 and earlier. That said, I have had some problems with larger stata datasets. e.g. http://stackoverflow.com/questions/28748088/overflow-error-with-pandas-read-stata – JohnE Apr 30 '15 at 17:30
  • 1
    you can use ``chunksize=..`` option as of 0.16.0. should be quite efficient – Jeff Apr 30 '15 at 17:35
  • 1
    Good point, I did still have that same problem with version 16 but didn't try the chunksize option. – JohnE Apr 30 '15 at 17:43