How can I parallelize this python script

Question

Hey I've got a working script but it work on k-combinations so it work long... I want to parallelize a for loop to divide the work time.

Here is the simplified code:

fin2 = open('combi_nod.txt','r')
for lines in fin2:
    (i, j) = eval(lines)
    edgefile = open('edge.adjlist', 'a')
    count = 0
    for element in intersection(
            eval(linecache.getline('triangleset.txt', i+1)),
            eval(linecache.getline('triangleset.txt', j+1))):
        if element not in merge:
           count = 1
           break
    if count == 0:
        edgefile.write(' ' + str(j))
    edgefile.close()
fin2.close()

How can I do this?

EDIT

After some modification I have accomplished the multiprocessing loop. But their is a associate issue:

in my initial for loop I search in the combi_nod.txt file; combi_nod.txt content is the itertools.combinaison of large number. (so, at a point I can anymore store them in variable)

My multiprocessing loop work with a list of this itertools.combinaison because I haven't see a way to pass line of a file in arguments (so I have a memory issue), have you a new Idea?

EDIT2

For clarification, here is the code like it is a this point:

def intersterer(lines):
  (i, j) = lines
  counttt = 0
  for element in some_stuff:
    if element not in merge:
      counttt = 1
      break
  if counttt == 0:
     return (int(i), int(j))
  else:
     return (0, 0)

fin2 = open('combi_nod.txt','w')
for trian_c in itertools.combinations(xrange(0, counter_tri), 2):
#counter_tri is a large number
    fin2.write(str(trian_c) + "\n")
fin2.close()
fin2 = open('combi_nod.txt','r')

if __name__ == '__main__':
    pool = Pool() 
    listt = pool.map(intersterer, itertools.combinations(xrange(0, counter_tri), 2))  
    f2(listt)
    if (0,0) in listt: listt.remove((0,0))

and I want to have something working like:

listt = pool.map(intersterer, fin2)

But all my tests doesn't work at all... Help...

It's hard to tell from the simplified code, but where is the process spending most of its time, computing the `eval()`s or reading/writing the files? — martineau, Jan 14 '15 at 20:58
All my variables are written in files for minimize RAM usage. My script generate a lot of data so I can't store all of them in memory. — Alex Jud, Jan 14 '15 at 21:23
The only process I want parallelize is the first for loop. I'm thinking about doing eight lines at once. — Alex Jud, Jan 14 '15 at 21:27
That statement's a bit confusing since the second for loop is nested inside the first. Writing everything to files may complicate the matter if you need simultaneous write access to the same one from concurrent processes or threads. In concurrent programming access to any shared resource has to be controlled by locks or semaphores or something similar. — martineau, Jan 14 '15 at 23:09
I can split the initial file and merge at final step. Thks to talk me about this upcoming issue. Like I said it's the first loop that I would parallelize. — Alex Jud, Jan 15 '15 at 07:03
Being able to to write to separate files output files will be very helpful. I'm now trying to determine if `linecache` is thread-safe (due to its internal cache). Also, can you describe (in your question) what kind of expressions the `eval()` calls are processing? It's necessary to know what they're doing in order to determine if they can safely run in parallel. — martineau, Jan 15 '15 at 09:43
OK, again, good. Bad news is `linecache` doesn't look thread-safe. Could you use `mmap` which would make the lines look like a huge array in memory? See [this](http://stackoverflow.com/questions/2787276/python-huge-file-reading-by-using-linecache-vs-normal-file-access-open) question. It fact it may be that you don't need to use parallel processing at all because the issue may be file i/o. — martineau, Jan 15 '15 at 09:58
Or I can just duplicate the source file. Nmap in my experience make just blow up my memory. In my computer stats I see that only one core is in use, and he is in full use. — Alex Jud, Jan 15 '15 at 10:16
If you have and want to use multiple cores the `multiprocessing` module is the way to go. You can define a `Pool` of functions and pass each one parameters that tell it which files to read and write. As far as the `eval()`s go, `threading` really doesn't run Python code in parallel due the GIL (Global Interpreter Lock) which isn't released so another thread can run until the current one waits for I/O. — martineau, Jan 15 '15 at 11:36
I suggest you [profile your code](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script/7693928#7693928) with the Python profiler and see what's delaying things the most, — martineau, Jan 15 '15 at 11:40
I'll test profile soon. Ok for multiprocessing module. But I'm not sure how I can do it? Can you help me with it? — Alex Jud, Jan 15 '15 at 11:51
Basically I'd start by making your current code a function that has the files it uses passed to it as arguments instead of using fixed strings as it does now. Initialization code would need be written that determined what set of separate files to past each process to work on and store them in a list of filename argument tuples. i.e. `[(f1a,f2a,f3a), (f1b,f2b,f3b), etc]`. Then create a fixed sized `Pool` called `my_pool` and start separate process that will use it by using `my_pool.map(, — martineau, Jan 15 '15 at 12:12
I've accomplished something like you said, but the list of argument is bigger than memory (it why I store this elements in the `fin2` file) Is their a way to pass line of a file in argument? (see EDIT) — Alex Jud, Jan 16 '15 at 12:25
Since the `fin2` file is only being read, multiple sub-processes should be able to access it at the same time. But in general, yes, it's possible to exchange data using Queues and PIpes. as well as share data in memory between process (although the latter is discouraged). Data can also be written to a temporary file and just the file name passed as a argument. — martineau, Jan 16 '15 at 17:22
Thanks martineau, but I really don't see what you meant I must do... I post a new edit, (EDIT2), perhaps its more clear... — Alex Jud, Jan 17 '15 at 09:50

How can I parallelize this python script

0 Answers0