0

I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. The code snippet is as follows:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

When running on a machine with memory of 4GB, the python complains Memory Error. If I change the evaluation expression of sample variable to [l for l in lines], it works fine.

At first, I thought it was due to the split method that was consuming lots of memory, so I adjust my code to this:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

But it turns out the same.

A new discovery is that it will run normally without OOM provided I remove the dict() conversion regardless of the code logic.

Could anyone give me some idea on this problem?

Lalit Kumar B
  • 47,486
  • 13
  • 97
  • 124
Judking
  • 6,111
  • 11
  • 55
  • 84

2 Answers2

2

You’re creates a list containing every line, which will continue to exist until lines goes out of scope, then creating another big list of entirely different strings based off of it, then a dict off of that before it can go out of memory. Just build the dict in one step.

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip twice.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
Ry-
  • 218,210
  • 55
  • 464
  • 476
1

What if you turn your list into a generator, and your dict into a lovely dictionary comprehension:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip())

Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements?

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
supermitch
  • 2,062
  • 4
  • 22
  • 28
  • This does not solve the (likely) problem of reading the whole file into memory (which is an unnecessary waste, in any case). minitech's answer avoids this. – Eric O. Lebigot Apr 15 '15 at 03:56
  • 1
    Doesn't using a generator to read lines do exactly that? Since it doesn't build a list of lines in memory? – supermitch Apr 15 '15 at 03:58
  • You are not using a generator, with your `f2.readlines()`: this puts the whole file in memory (since it builds a *list* with all the lines). `readlines()` can be avoided, most of the time. Even if you avoided it here by doing `for l in f2`, (where f2 is an iterator) you would still be storing all the lines in your `lines`. Again, minitech's answer is the way to go (smallest memory footprint possible). – Eric O. Lebigot Apr 15 '15 at 04:07
  • @EOL: `lines = (l.strip() for l in f2 if l.strip())` would make it a lazily-evaluated generator. (Note parentheses instead of square brackets.) – Ry- Apr 15 '15 at 04:13
  • Oops, I meant to remove `readlines()`... and do `l in f2`, as per minitech's comment. – supermitch Apr 15 '15 at 04:13
  • @minitech Good point, I read too fast and had missed the parentheses instead of brackets. Python could be even more explicit than it is already, by forcing `iter()` to be used. :) – Eric O. Lebigot Apr 15 '15 at 07:23