0

I'm working on an app that processes a lot of data.

.... and keeps running my computer out of memory. :(

Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.

Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.

Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?

KarenRei
  • 589
  • 6
  • 13
  • There are a lot of ways to create arrays in numpy. Where is your data coming from? Are you computing it, reading it from a socket, pulling it from a CSV file, or ... ? – aghast Jul 22 '17 at 03:06
  • ints, floats, bytes, strings? – wwii Jul 22 '17 at 04:28
  • @Austin: I'm parsing it with regexes out of json files. Some points and lines get thrown away in processing, but the files are massive. wwii: I mentioned the datatypes in the question - ideally float32 and uint32. – KarenRei Jul 22 '17 at 11:14

1 Answers1

1

Based on your comments, I'd suggest that you split your task into two parts:

1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.

2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

aghast
  • 14,785
  • 3
  • 24
  • 56
  • Actually, that's a great idea; it'll save a lot of time when processing needs to be re-run (which during initial debugging is often!). Thanks! (so long as the data structure returned by pandas is memory efficient :) ) – KarenRei Jul 23 '17 at 12:59
  • For the record, I'm skipping pandas - after messing around with it for a while I came to the conclusion that numpy's genfromtxt is much better. – KarenRei Jul 24 '17 at 18:22