2

I have a "not so" large file (~2.2GB) which I am trying to read and process...

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

Am i doing something wrong??

Its been like an hour.. since the code is reading the file.. (its still reading..)

And tracking memory usage is already 20GB.. why is it taking so time and memory??

frazman
  • 32,081
  • 75
  • 184
  • 269
  • 2
    Oh, well, at least you're not having a 50G memory leak like the one I had a while ago :D That said, ever looked at graph manipulation libraries such as [NetworkX](http://networkx.github.io/)? They're probably more efficient! – F.X. Nov 05 '13 at 18:48
  • 1
    Comment out the dict-building code and see how long it takes to read the file. My guess is that it will run quickly then. My other guess is the same as @DSM's: you're probably creating an enormous number of dicts. – Tim Peters Nov 05 '13 at 18:54
  • I'm not confident enough to post this as an answer, but shouldn't you use f.readlines() first? – Dunno Nov 05 '13 at 18:56
  • 1
    @Dunno: No. `readlines()` will make the memory issue worse: it will read the entire file into memory before the loop starts, where `for line in f:` will put just single lines into memory. – bukzor Nov 05 '13 at 18:58
  • @bukzor: I just thought `for line in f:` won't work properly without using `readlines()` first. Anyway, thanks and never mind. – Dunno Nov 05 '13 at 19:00
  • Extract from Python Manual - For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code: >>> for line in f: ... print(line, end='') – shad0w_wa1k3r Nov 05 '13 at 19:01
  • What Python interpreter / version do you use? – moooeeeep Nov 05 '13 at 19:05
  • @moooeeeep: I am using python 2.6 (the one that comes with redhat) – frazman Nov 05 '13 at 19:07
  • Did you ever figure out the problem? – martineau Nov 12 '13 at 22:35

4 Answers4

3

To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.

bukzor
  • 37,539
  • 11
  • 77
  • 111
2

Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.

To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:

Suggestions to work around this include:

  • use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
  • use a library that implements the functionality in C, e.g., numpy, pandas
  • use another interpreter, e.g., PyPy
Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
2

There are a few things you can do:

  1. Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.

    counter = 0 with open("final_edge_list.txt","r") as f: for line in f: counter += 1 if counter == 200000: break try: ...

    On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.

  2. Your graph seems symmetric

    graph[src][destination] = weight
    graph[destination][src] = weight
    

    In your graph processing code use symmetry of graph, reduce memory usage by half.

  3. Run profilers on you code using subset of the data, see what happens there. Simplest would be to run

    python -m cProfile --sort cumulative youprogram.py
    

    There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/

Dmitriy
  • 340
  • 2
  • 9
2
  • You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
  • To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
  • What do you plan to do with your nodes list afterwards?
martineau
  • 119,623
  • 25
  • 170
  • 301
latheiere
  • 451
  • 4
  • 14