Multiprocess or threading with huge data structure for RAM and speed issues. Python 2.7

Question

I'm writing an application about MST algorithm passing huge graph (like 100 / 150 milion edges) in Python 2.7 . Graph is setted up with Adjacency List using a classic class with method like :

def insertArcW(self, tail, head, weight):
    if head in self.nodes and tail in self.nodes:
        self.adj[tail].addAsLast(head)
        self.adj[tail].addAsLast(weight)

def insertNode(self, e):
    newnode = Node(self.nextId, e)
    self.nextId += 1

I'm also using Linked List (created with array) and queue from python stdlibrary(version 2.7). With this piece of code the insert is really fast (due to less number of nodes compare to number of edges.):

n = []
for i in xrange(int(file_List[0])):
      n.append(G.insertNode(i))

Problem comes with the insert of the edges:

for e in xrange(len(arc_List))
            G.insertArcW(n[arc_List[e][0]].index, n[arc_List[e][1]].index,arc_List[e][2])

            G.insertArcW(n[arc_List[e][1]].index, n[arc_List[e][0]].index,arc_List[e][2])

It's working great with 1 milion edges but with more it going to eat all of my ram (4GB , 64bit) but no freeze ! It can build the graph in a long time ! Considering that usage of CPU is limited to 19/25 % while doing this , there is a way of doing such things in multiprocess or multithread ? Like build the graph with two core doing same operation at same time but with different data ? I mean one core working with half of edges and another core with other half. I'm practically new to this "place of programming" above all in Python.

EDIT : By using this function i'm setting up two list for nodes and edges ! I need to take information by a ".txt" file. Inserting the insertArcW and insertNode there is a oscillation of RAM between 2.4GB to 2.6GB . Now I can say that is stable (maybe due to "delete" of the two huge list of edges and node) but always at the same speed. Code :

f = open(graph + '.txt','r')   
    v = f.read()   
    file_List = re.split('\s+',v) 
    arc_List = []    
    n = []  
    p = []       
    for x in xrange(0,int(file_List[1])):
        arc_List.append([0,0,0])
    for i in xrange(int(file_List[0])):
        n.append(G.insertNode(i))
    for weight in xrange(1,int(file_List[1])+1): 
        p.append(weight) 
    random.shuffle(p) 
    i = 0  
    r = 0  
    while r < int(file_List[1]):
        for k in xrange(2,len(file_List),2):
            arc_List[r][0] = int(file_List[k])   
            arc_List[r][1] = int(file_List[k+1])  
            arc_List[r][2] = float(p[i])  
            G.insertArcW(n[arc_List[r][0]].index, n[arc_List[r][1]].index,arc_List[r][2])
            G.insertArcW(n[arc_List[r][1]].index, n[arc_List[r][0]].index,arc_List[r][2])
            print r             
            i+=1
            r+=1

    f.close()

Fastest and often the cheapest way to speed up anything like this is to invest in some more RAM if that is at all possible. 4GB on a 64 bit machine is near the bottom limit. — Steve Barnes, Aug 16 '15 at 12:37
@SteveBarnes I know for sure that this kind of "project" work a bit better with a 8GB / 16GB of RAM. But I want to try to solve the problem in the way I said before. At the time not all of my resources (4 core CPU , only one "used") are used. I really love to make something helpfull for every target , also the cheaper one (2 or 4 core with low RAM). I mean .. why enlarge hardware if I can do something better with Knowledge of the Architecture (always if is possible). — Roberto Pavia, Aug 16 '15 at 12:49
Keep in mind that there is some memory overhead to multi-processing so if you are short on RAM things can actually get worse when you switch to multi-processing. — Steve Barnes, Aug 16 '15 at 12:51
Unrelated: there is no Linked List in stdlib. Python list uses an array as an implementation. — jfs, Aug 16 '15 at 12:52
Unfortunately, you can easily parallelize the computations without increasing [memory usage](http://stackoverflow.com/q/1268252/4279) in this case (hard to reduce to using numpy arrays, other ctypes-compatible types). Python doesn't look like the right tool for the job here. — jfs, Aug 16 '15 at 12:57
@SteveBarnes Ok . What about threading ? It has shared memory if I'm not wrong .. Maybe there is a chance with this way of doing ? Two thread sharing the same memory , one starting work from the beginning and the other one from the middle. Unrelated : Sorry about it I'm going to edit it in my post. Class Record for Linked_List been created with array. — Roberto Pavia, Aug 16 '15 at 13:00
@J.F.Sebastian Sorry , did not understand. I can't parallelize computations ? (Adding info : I can't use non-standard python library. Only std related to python 2.7) — Roberto Pavia, Aug 16 '15 at 13:07
Python as the [GIL](http://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil), which means threading any CPU-intensive tasks won't boost performance (unless you use something like numpy). — Colonel Thirty Two, Aug 16 '15 at 13:10
Basically - whatever you use if the data will not all fit into the available RAM you will be constrained to disk IO speeds as the bottleneck rather than the processor speed/number of cores. You could probably structure your data such that all of the data is stored on disk in the order it is needed which would speed things up more than windows swap would. Note that partitioning the data is also needed for multiprocessing/threading. — Steve Barnes, Aug 16 '15 at 13:19
@RobertoPavia: yes, I meant "**can't** easily parallelize" *in this case*. — jfs, Aug 16 '15 at 13:19
@ColonelThirtyTwo thumbs up . Thank you For GIL information. Adding some news to my python cultural baggage. "Use something like numpy" .. How doing this with numpy ? — Roberto Pavia, Aug 16 '15 at 13:27
@SteveBarnes UP: edited my code adding the InsertArc and InsertNode in the function that create the list of edges by txt file. Is this something related to the partitioning the data ? Every FOR cycle is setted with xrange(). In this way my RAM usage is "quietly" good ! Oscillates between 2,4 GB to 2,6 GB. Going to edit my post with this News. — Roberto Pavia, Aug 16 '15 at 13:32
check out [PyParallel (cpython fork)](https://github.com/pyparallel/pyparallel) that might be suitable if you can adapt your use-case to the *"don't persist parallel objects"* model (pyparallel is based on Python 3.3 and works only on Windows). — jfs, Sep 10 '15 at 19:52

Multiprocess or threading with huge data structure for RAM and speed issues. Python 2.7

0 Answers0