Memory error when processing files in Python

Question

I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. The code snippet is as follows:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

When running on a machine with memory of 4GB, the python complains Memory Error. If I change the evaluation expression of sample variable to [l for l in lines], it works fine.

At first, I thought it was due to the split method that was consuming lots of memory, so I adjust my code to this:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

But it turns out the same.

A new discovery is that it will run normally without OOM provided I remove the dict() conversion regardless of the code logic.

Could anyone give me some idea on this problem?

Somewhere on this site is a question about how much memory a `dict` takes, and it's much more than you would expect. — Mark Ransom, Apr 15 '15 at 03:39
Could you give out the specific URL link related to what you've mentioned? Thanks. @MarkRansom — Judking, Apr 15 '15 at 03:41
If I could remember it, I would have done so already. Sorry. — Mark Ransom, Apr 15 '15 at 03:42
http://stackoverflow.com/questions/10264874/python-reducing-memory-usage-of-dictionary — Greg, Apr 15 '15 at 03:43
@Judking: Yep – [`csv`](https://docs.python.org/3/library/csv.html#examples) is very flexible. — Ry-, Apr 15 '15 at 03:49
@Judking Do you really need a dictionary? The fact that you are assigning 1 to all the keys make me wonder if you actually don't need a `set` instead (maybe not). — Eric O. Lebigot, Apr 16 '15 at 07:45

score 2 · Answer 1 · edited Apr 15 '15 at 03:52

2

You’re creates a list containing every line, which will continue to exist until lines goes out of scope, then creating another big list of entirely different strings based off of it, then a dict off of that before it can go out of memory. Just build the dict in one step.

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip twice.

edited Apr 15 '15 at 03:52

Eric O. Lebigot

91,433
48
218
260

answered Apr 15 '15 at 03:45

Ry-

218,210
55
464
476

I removed the explicit `r` mode, because it is the default. – Eric O. Lebigot Apr 15 '15 at 03:52
I agree, about not repeating `strip`. Another option is to do `map(str.strip, f)` (in Python 3), or `itertools.imap(…)` (in Python 2). – Eric O. Lebigot Apr 15 '15 at 03:54

score 1 · Answer 2 · edited Apr 16 '15 at 07:44

1

What if you turn your list into a generator, and your dict into a lovely dictionary comprehension:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip())

Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements?

edited Apr 16 '15 at 07:44

Eric O. Lebigot

91,433
48
218
260

answered Apr 15 '15 at 03:44

supermitch

2,062
4
22
28

This does not solve the (likely) problem of reading the whole file into memory (which is an unnecessary waste, in any case). minitech's answer avoids this. – Eric O. Lebigot Apr 15 '15 at 03:56
1

Doesn't using a generator to read lines do exactly that? Since it doesn't build a list of lines in memory? – supermitch Apr 15 '15 at 03:58
You are not using a generator, with your `f2.readlines()`: this puts the whole file in memory (since it builds a *list* with all the lines). `readlines()` can be avoided, most of the time. Even if you avoided it here by doing `for l in f2`, (where f2 is an iterator) you would still be storing all the lines in your `lines`. Again, minitech's answer is the way to go (smallest memory footprint possible). – Eric O. Lebigot Apr 15 '15 at 04:07
@EOL: `lines = (l.strip() for l in f2 if l.strip())` would make it a lazily-evaluated generator. (Note parentheses instead of square brackets.) – Ry- Apr 15 '15 at 04:13
Oops, I meant to remove `readlines()`... and do `l in f2`, as per minitech's comment. – supermitch Apr 15 '15 at 04:13
@minitech Good point, I read too fast and had missed the parentheses instead of brackets. Python could be even more explicit than it is already, by forcing `iter()` to be used. :) – Eric O. Lebigot Apr 15 '15 at 07:23

Memory error when processing files in Python

2 Answers2