1

I have a dataset in stata that is about 5.8 million rows(records).

I've been learning pandas the past few months and really enjoy its capabilities. Would pandas still work in this scenario?

I am having trouble reading the dataset into a dataframe. I'm currently looking at chunking... chunks = pd.read_stata('data.dta', chunksize = 100000, columns = ['year','race', 'app'])

Is there a better way to go about this? I am hoping to do something like:

df = pd.read_stata('data.dta')
data = df.groupby(['year', 'race']).agg(sum)
data.to_csv('data.csv')

but that does not work because (i think) the dataset is too large. error: OverflowError: Python int too large to convert to C long

Thanks. Cheers

As3adTintin
  • 2,406
  • 12
  • 33
  • 59
  • Are you sure the issue is not with the sum ? 6m is far form INT_MAX even on 32bits platform. – Kirell Jun 16 '15 at 20:47
  • @Kikohs Thanks for the insight. Yes, I'm having trouble just at the `pd.read_stata('data.dta')` stage - havn't even got to the `groupby` function – As3adTintin Jun 16 '15 at 20:48
  • 1
    Could you read by chunks and export to csv? Then read the CSV again? I have never worked with stata file. But pandas easily handles millions of rows for CSV. – Kirell Jun 16 '15 at 20:51
  • @Kikohs hmm i tried exporting it as a csv from stata and then reading that using `pd.read_csv`... however i got a `MemoryError` would 113 variables be too much? – As3adTintin Jun 16 '15 at 21:34
  • What is your hardware spec? 113 columns is quite big. It depends on the data type also. Pandas is not really the best tool if you cannot load everything in RAM. – Kirell Jun 16 '15 at 22:00
  • I suggest to have a look at GraphLab-Create or PySpark. GraphLab has SFrames (almost like a pandas Dataframes). It is memory and disk based and can handle very big datasets. (I am using it on a dataset of 2Billions rows right now). – Kirell Jun 16 '15 at 22:03
  • @Kikohs I'm on 16gb of ram... so that shouldn't be a problem, right? I'm on a 64bit i5-3570 16gbram machine – As3adTintin Jun 17 '15 at 13:49

0 Answers0