3

I am using the following code (see here Pandas read_stata() with large .dta files) to load a very large Stata dataset (20GB) in Python. My machine has 128 GB RAM.

def load_large_dta(fname):
   import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '\nloaded {} rows'.format(len(df))

   return df

Problem is: I get the following error:

OverflowError: Python int too large to convert to C long

Do you know how to fix that?

Community
  • 1
  • 1
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • @AbrahamDFlaxman , creator of this nice iterator, I summon you! ;-) – ℕʘʘḆḽḘ Jan 15 '16 at 16:21
  • 2
    just a heads up, the '@user' notification only works if the user has participated in this post. This this [meta SE](http://meta.stackexchange.com/questions/43019/how-do-comment-replies-work) question for more info – wnnmaw Jan 15 '16 at 16:23
  • oh ok thanks! does that mean there is no way to attract someone's attention in a comment? – ℕʘʘḆḽḘ Jan 15 '16 at 16:25
  • 1
    Only if they've been here before – wnnmaw Jan 15 '16 at 16:27
  • 2
    Aside: that code is very inefficient. `append` has to make a new copy of the entire frame, and so very soon you'll spend more time doing the append over and over again than the reading. If you really want a chunked read, store the chunks and then concatenate them. – DSM Jan 15 '16 at 16:45
  • @DSM thank you for your input. I am learning Pandas as we speak. Can you please show me how you would read the chunks as you suggest? Thanks!!! – ℕʘʘḆḽḘ Jan 15 '16 at 16:49
  • 1
    Basically `chunks = []; while len(chunk) > 0: chunks.append(reader.get_chunk(...)); df = pd.concat(chunks)`. – filmor Jan 15 '16 at 17:01
  • @filmor please post a complete snippet so that I can accept your answer – ℕʘʘḆḽḘ Jan 16 '16 at 18:16
  • I don't know whether it fixes your error (post a more complete traceback), it's just an explanation how to do chunking in pandas without using `append`. – filmor Jan 17 '16 at 10:11

0 Answers0