Parallel for loop, Python

Question

Currently, this nested for loop takes almost an hour to run through. I am hoping to rewrite it and create some parallel synchronization. I have not found an answer anywhere on how to do something nested like I have below. Any pointers in the right direction would be greatly appreciated.

  #used to update the Software Name's from softwareCollection using the regexCollection
    startTime = time.time()
    for x in softwareCollection.find({}, {"Software Name":-1,"Computer Name":-1,"Version":-1,"Publisher":-1,"reged": null }, no_cursor_timeout=True):
        for y in regexCollection.find({}, {"regName": 1,"newName":1}, no_cursor_timeout=True):
            try:
                regExp = re.compile(y["regName"])
            except:
                print(y["regName"])
                break
            oldName = x["Software Name"]
            newName = y["newName"]
            if(regExp.search(oldName)):
                x["Software Name"] = newName
                x["reged"] = "true"
                softwareCollection.save(x)
                break
            else:
                continue
    print(startTime - time.time() / 60)
    cursor.close()

So what is is doing is taking the software name from a mongoDB column, and comparing it to a list of regex queries I have saved in a separate mongo collection. If the name matches the regex it then renames the field to whatever name is associated with that regex. — Loglem, May 02 '17 at 21:45

score 0 · Answer 1 · edited May 23 '17 at 11:47

0

Depending on the number of iterations over x, you could spawn a thread for each x step, that would iterate over y.

First, define the running function depending on x:

def y_iteration(x):
    for y in ... :
        ...

Then spawn a thread that runs this function at every iteration over x:

for x in ... :
    _thread.start_new_thread(y_iteration, (x,))

This is a very basic example, using the low-level _thread module.

Now you might need to join a main thread, in which case you will want to use the threading module instead. You'd probably put you x iteration in a thread and join it:

def x_iteration():
    for x in ... :
        threading.Thread(target=y_iteration, args=(x,)).start()

thread = threading.Thread(target=x_iteration)
thread.start()
thread.join()

Then again, this depends on the number of iterations over x you are planning to do (have a look at How many threads it too many?). If that number should be great, you may want to create a pool of, say, one hundred workers, and feed them with y_iteration. When every worker is working, wait until one is free.

edited May 23 '17 at 11:47

Community

1
1

answered May 02 '17 at 21:35

Right leg

16,080
7
48
81

there is a total of 3.5 million entries so I am thinking pooling is definitely the way to go? – Loglem May 02 '17 at 21:46
@Loglem 3.5 million iterations over `x`? Yeah, that's how I would handle this. – Right leg May 02 '17 at 21:49
yes 3.5 million over x and 450 over y. Have you seen any examples of someone doing something similar to this using the pool import? – Loglem May 02 '17 at 21:54
@Loglem I cannot think of any, though it does not seem really complicated to me. Since the main point is to limit the number of threads, a basic implementation would be to hold a (ugh) `global` count of threads, and wait whenever this number reaches a limit. The function of each thread (here `y_iteration`) increments this count at the beginning, and decrements it at the end. This is not exactly a pool of threads, because each thread works only once... But this gives you the idea. Just do the same with a list of threads instead of a count. – Right leg May 02 '17 at 23:19

score 0 · Answer 2 · answered May 03 '17 at 22:16

So I was able to get this to run and work about twice as fast as the sequential version. My concern is that it still takes 4 hours to finish the process. Is there a way to make this even more efficient or should I expect this to take this long.

#used to update the Software Name's from softwareCollection using the regexCollection
def foo(x):
    for y in regexCollection.find({}, {"regName": 1,"newName":1}, no_cursor_timeout=True):
        try:
            regExp = re.compile(y["regName"])
        except:
            print(y["regName"])
            break
        oldName = x["SoftwareName"]
        newName = y["newName"]
        if(regExp.search(oldName)):
            x["SoftwareName"] = newName
            x["field4"] = "reged"
            softwareCollection.save(x)
            break
        else:
            continue


if __name__ == '__main__':
    startTime = time.time()
    Parallel(n_jobs=4)(delayed(foo)(x) for x in softwareCollection.find())

    print(time.time() - startTime / 60)
    cursor.close()

Parallel for loop, Python

2 Answers2