5

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).

My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).

I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:

print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
    currentLine += 1.0
    lined = line.split('\t')
    printEvery100000 += 1
    if printEvery100000 == 100000:
        print str(currentLine / totalLines)
        printEvery100000 = 0;
    lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'

EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D

Hamy
  • 20,662
  • 15
  • 74
  • 102
  • I imagine because lists take up more space in memory than the sum of their parts. – Rafe Kettler May 20 '11 at 04:16
  • Fair enough, but we are talking ~5x more memory than I expect is being consumed. I don't think they take up that much extra!! – Hamy May 20 '11 at 04:18
  • @Hamy yeah, it seems like a bit much to me too. – Rafe Kettler May 20 '11 at 04:20
  • me no gusta :/ My hunch is that somehow the lines being read in using the for loop are not being released from memory, but calling flush on input_file throws an error (e.g. I can't seem to force a read-only file to clean up it's stdio memory) – Hamy May 20 '11 at 04:23
  • 2
    That memory usage is more than I expected. I would import the data into a database run a query with the expected ORDER BY clause, and write the results to disk. Which version of Python are you using? – Steven Rumbalski May 20 '11 at 04:27
  • 2.5.4. Let's see if I have mysql still installed, that's not a bad option . . . – Hamy May 20 '11 at 04:29
  • 2
    The overhead is higher than you might think. I did a quick analysis of PHP's variable memory usage [over here](http://stackoverflow.com/questions/5972170/what-is-the-overhead-of-using-php-int/5972314#5972314), I'd expect Python (and Perl and Ruby and ...) to produce similar results. – mu is too short May 20 '11 at 04:39
  • Interesting points mu - I was able to load the entire file by just calling readline, so you may be on to something. MySQL seems to be working, so I'm going to get the work done there, but I will change my script to load it all in during readline (and use the function I pass to key to cut it up as needed so I can sort), and let you guys know if it works that way – Hamy May 20 '11 at 04:53
  • mu it looks like you hit on the answer first - if you post that I'll accept it – Hamy May 20 '11 at 05:26
  • Assume each string in Python has 30 bytes of overhead. If you multiply 30 bytes by the number of fields per record and the number of records (30M) and add in 1.5GB, what do you get? – Gabe May 20 '11 at 05:56
  • Have you tried `import gc` and then running `gc.collect()` e.g. every 100000 ? (this will run the garbage collector). Also, you might want to try out `for line in input_file.xreadlines()`. – Andre Holzner May 20 '11 at 05:56

2 Answers2

3

Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.

It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:

  • 1.5 GB file size
  • 31,164,015 total lines
  • each line split into a list with 4 pieces

Code:

import sys
def sizeof(lst):
    return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)

GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))

val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols

line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))

Returns:

avg line size: 312 bytes
approx. memory needed: 9.1 GB
samplebias
  • 37,113
  • 6
  • 107
  • 103
  • Very interesting, although I'm fairly confident this math isn't correct, even though the final answer (and point) that much more room is required is indeed correct. The reason I'm doubting is that I can call readlines() and it loads the entire file, using around 1.5Gb, which seems to be the memory model you are going for (based off of the def of sizeof) e.g. a bunch of line objects (Strings) that each contain a number of characters – Hamy May 20 '11 at 05:13
  • Edit: takes around 1.7Gb to load using readLines(), so it does seem that the culprit is the variable allocation – Hamy May 20 '11 at 05:20
  • 1
    The 9.1GB estimate is a high-water mark assuming you use 100% of each line (you're doing some splitting and indexing individual fields, so I don't know the resulting field lengths). So if you use only 2/3 of each line on average that results in ~6GB used, which lines up with your 3GB at 50% observation. – samplebias May 20 '11 at 06:03
1

I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).

This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.

You can find a file-seeking binary search algorithm in this question.

Hopefully a memory efficient way of sorting a massive file by line:

import os
from mmap import mmap

input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')

# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()

for line in input_file:
  mm.seek(0)
  tup = line.split("\t")
  while True:
    cur_loc = mm.tell()
    o_line = mm.readline()
    o_tup = o_line.split("\t")
    if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
      mm.resize(cur_size + len(line))
      mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
      mm[cur_loc:cur_loc+len(line)] = line
      cur_size += len(line)
      break
Community
  • 1
  • 1
Kyle
  • 828
  • 8
  • 8