I am using the following code (see here Pandas read_stata() with large .dta files) to load a very large Stata dataset (20GB) in Python. My machine has 128 GB RAM.
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
Problem is: I get the following error:
OverflowError: Python int too large to convert to C long
Do you know how to fix that?