11

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

My questions are:

  1. Is there something I am doing wrong that is resulting in Pandas having issues?
  2. Is there a workaround to get the data into a Pandas dataframe?
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Jonathan
  • 303
  • 1
  • 2
  • 6

5 Answers5

8

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '\nloaded {} rows'.format(len(df))

    return df

I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.

This notebook shows it in action.

Abraham D Flaxman
  • 2,969
  • 21
  • 39
  • I tested this function and the method using `read_stata` chunksize (as suggested by Jinhua Wang) against using `read_stata` without using chunksize, on a dataset with 1.8m rows. For me, the without chunksize method took 5mins. When running the two optimisations (twice) the function method was faster for me both times (by 10 seconds the first time at ~3mins, by 60 seconds the second at ~2mins) – FullMetalScientist Jan 17 '22 at 23:21
4

There is a simpler way to solve it using Pandas' built-in function read_stata.

Assume your large file is named as large.dta.

import pandas as pd

reader=pd.read_stata("large.dta",chunksize=100000)

df = pd.DataFrame()

for itm in reader:
    df=df.append(itm)

df.to_csv("large.csv")
Jinhua Wang
  • 1,679
  • 1
  • 17
  • 44
3

For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.

Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)

Community
  • 1
  • 1
AZhao
  • 13,617
  • 7
  • 31
  • 54
0

Question 1.

There's not much I can say about this.

Question 2.

Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.

Run help outsheet for details of the exporting.

Roberto Ferrer
  • 11,024
  • 1
  • 21
  • 23
-2

You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas). The right way to do this is to mem-map the file and access the data as needed.

You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:

from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)

This will transparently take care of chunk-loading, multicore data handling and all that stuff.

javier
  • 11