Reading a large file in python

Question

I have a "not so" large file (~2.2GB) which I am trying to read and process...

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

Am i doing something wrong??

Its been like an hour.. since the code is reading the file.. (its still reading..)

And tracking memory usage is already 20GB.. why is it taking so time and memory??

Oh, well, at least you're not having a 50G memory leak like the one I had a while ago :D That said, ever looked at graph manipulation libraries such as [NetworkX](http://networkx.github.io/)? They're probably more efficient! — F.X., Nov 05 '13 at 18:48
Comment out the dict-building code and see how long it takes to read the file. My guess is that it will run quickly then. My other guess is the same as @DSM's: you're probably creating an enormous number of dicts. — Tim Peters, Nov 05 '13 at 18:54
I'm not confident enough to post this as an answer, but shouldn't you use f.readlines() first? — Dunno, Nov 05 '13 at 18:56
@Dunno: No. `readlines()` will make the memory issue worse: it will read the entire file into memory before the loop starts, where `for line in f:` will put just single lines into memory. — bukzor, Nov 05 '13 at 18:58
@bukzor: I just thought `for line in f:` won't work properly without using `readlines()` first. Anyway, thanks and never mind. — Dunno, Nov 05 '13 at 19:00
Extract from Python Manual - For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code: >>> for line in f: ... print(line, end='') — shad0w_wa1k3r, Nov 05 '13 at 19:01
@moooeeeep: I am using python 2.6 (the one that comes with redhat) — frazman, Nov 05 '13 at 19:07

score 3 · Answer 1 · answered Nov 05 '13 at 18:57

3

To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.

answered Nov 05 '13 at 18:57

bukzor

37,539
11
77
111

Where does `make_graph()` come from? – PascalVKooten Nov 05 '13 at 19:00
@Dualinity: Copy-pasted for your convenience: Wrap your above code in a make_graph() function (this is best practice anyway) ... – bukzor Nov 05 '13 at 19:01
You shouldn't use imports inside a function. Just put them at the top level. – Bakuriu Nov 05 '13 at 19:15
@bukzor I mean, I don't understand what it is? Does it belong to a package? – PascalVKooten Nov 05 '13 at 19:15
@Dualinity At the start of this answer bukzor says "Wrap your above code in a `make_graph()` function". I.e. `make_graph()` is defined copy-pasting the code in the question inside a function. – Bakuriu Nov 05 '13 at 19:16
@Bakuriu Haha that's hilarious. – PascalVKooten Nov 05 '13 at 19:17
I read it as something like, you can make your code more understandable when you use a `make_graph()` function, that will show a graph of what the user's functions do. – PascalVKooten Nov 05 '13 at 19:18

score 2 · Answer 2 · edited May 23 '17 at 12:29

Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.

To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:

Python: garbage collection fails?

Suggestions to work around this include:

use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy

score 2 · Answer 3 · answered Nov 05 '13 at 21:31

There are a few things you can do:

Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.

counter = 0 with open("final_edge_list.txt","r") as f: for line in f: counter += 1 if counter == 200000: break try: ...

On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
```
graph[src][destination] = weight
graph[destination][src] = weight
```
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
```
python -m cProfile --sort cumulative youprogram.py
```
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/

score 2 · Answer 4 · edited Nov 06 '13 at 09:47

2

You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?

edited Nov 06 '13 at 09:47

martineau

119,623
25
170
301

answered Nov 05 '13 at 22:11

latheiere

451
4
14

Reading a large file in python

4 Answers4

Linked