I am unsure of how to parallelize this for loop with the multiprocessing module

Question

I want to lower the time it takes for a for loop to complete using multiprocessing but I am not sure how to go about it explicitly as I have not seen any clear basic usage pattern for the module that I can apply to this code.

    allLines = fileRead.readlines()
    allLines = [x.strip() for x in allLines]
    for i in range (0,len(allLines)):
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')

EDIT: New code but it's still broken...

allLines = fileRead.readlines()
allLines = [x.strip() for x in allLines]
def f(indexes, allLines):
    for i in indexes:
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')
for i in range(threads):
    indexes = range(i*len(allLines), i*len(allLines)+threads, 1)
    Thread(target=f, args=(indexes, allLines)).start()

@politinsa I have a file full of words in individual lines and allLines is an array of those words while currentLine is a name I gave to reference each item or "line" as I indent it to move to the next one. — leon, Jun 24 '19 at 10:36
Did you try my answer? If it worked I can adjust it to use `concurrent.futures.ThreadPoolExecutor` — PyPingu, Jun 24 '19 at 12:17
@PyPingu I tried your answer and it worked beautifully! [EDIT: removed some nonsense about not printing] — leon, Jun 24 '19 at 12:30
Hmm. Does it throw any error? It should be writing into whatever is `fileA`. Perhaps print out free_words to check it’s actually a list of the words — PyPingu, Jun 24 '19 at 12:32
@PyPingu Nevermind that, I had left out a whole part of your solution. It works now. — leon, Jun 24 '19 at 12:34
I am checking 58000 URLs and it takes over 4 seconds between each check. As far as I can tell all that is happening is the different threads are taking turns. — leon, Jun 24 '19 at 12:41
So if you run it in a simple for-loop versus using the thread pool the total time is the same? — PyPingu, Jun 24 '19 at 13:02
what are `uReq` and `soup` variables? if they're `urllib` and "Beautiful Soup" respectively then you probably don't want to use threads as I don't think either [releases the GIL](https://stackoverflow.com/q/1294382/1358308). BS4 is also **very** slow at parsing HTML, so you might want to try another library. that said, you could try using processes with multiprocessing as that might give you more concurrency. — Sam Mason, Jun 24 '19 at 13:09
Ok, if what @SamMason says about the GIL is true that would explain things. To use multiple processes as he says - replace `multiprocessing.dummy` with just `multiprocessing`. — PyPingu, Jun 24 '19 at 13:13
@PyPingu I had a suspicion this was the case. Used multiprocessing instead and it appears that it is stuck and not returning any output. My CPU usage increases significantly while running. — leon, Jun 24 '19 at 13:56
Yeah well if you left it the pool with 10 workers and you only have 1 CPU it has possibly locked up your PC. You can try reducing your number of workers to 4 or something. The other possibility is to change the HTML parser you use as Sam suggetsed. You could try the [`lxml`](https://lxml.de/) module which is supposed to be fast. You could also just use threading for doing the requests. I'd use the [`requests`](https://2.python-requests.org/en/master/) module and just return a string of the page content and do all the parsing afterwards. Depends on where the bottleneck really is — PyPingu, Jun 24 '19 at 14:02
I have 4 cores and I already reduced the number to 4. When I check the running threads it tells me I have 5 threads running, so I reduced the number to 3 but it still doesn't work as intended. I will look at the modules and methods you suggested. Thank you. — leon, Jun 24 '19 at 14:05

politinsa · Answer 1 · 2019-06-24T12:10:22.563

0

Put the code in a function
Split indexes
Start threads

from threading import Thread

THREADS = 10

allLines = fileRead.readlines()
allLines = [x.strip() for x in allLines]

def f(indexes, allLines):
    #This entire for loop needs to be parallelized
    for i in indexes:
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')

for i in range(THREADS):
  indexes = range(i*len(allLines), i*len(allLines)+THREADS, 1)
  Thread(target=f, args=(indexes, allLines)).start()

edited Jun 24 '19 at 12:10

answered Jun 24 '19 at 10:06

politinsa

3,480
1
11
36

You have `args=(indexes, allLines)`, but `f(allLines, indexes)`. One of them should be swapped. – rtoijala Jun 24 '19 at 10:22
Yeah, I noticed that. Thank you @rtoijala it's fixed now. – leon Jun 24 '19 at 10:35
There is a slight problem with this code. It keeps printing/writing only the first word from the list. I have put the new/broken code in an edit to my question above. – leon Jun 24 '19 at 10:46

score 0 · Answer 2 · answered Jun 24 '19 at 11:38

Difficult to know exactly where the problem might be occurring without seeing the real input and output.

You could try this using the multiprocessing.dummy module which is just a wrapper around the Threading module.

import multiprocessing.dummy

def parse_url(word):
    currentURL = URL+word
    uClient = uReq(currentURL)
    pageHTML = uClient.read()
    uClient.close()
    pageSoup = soup(pageHTML,'html.parser')
    pageHeader = str(pageSoup.h1)
    if 'Sorry!' in pageHeader:
        print(currentURL,'is available.')
        return word
    else:
        print(currentURL,'is taken.')
        return None

with open(fileR,'r') as fileRead:
    #This is just for printing two newlines? Could replace with a single print('\n')
    print('')
    print('')
    print(fileRead.name,fileRead.mode)
    with open(fileA,'w') as fileWrite:
        fileWrite.write('')
        print('')
        print('')
        print(fileWrite.name,'emptied.')
    allLines = fileRead.readlines()
    allLines = [x.strip() for x in allLines]

#Make a pool of 10 worker threads
with multiprocessing.dummy.Pool(10) as pool:
    result = pool.map_async(parse_url, allLines)
    #wait for all the URLs to be checked
    word_list = result.get()
    free_words = [x for x in word_list if x is not None]

with open(fileA,'w') as fileAppend:
    fileAppend.write('\n'.join(free_words))

I am unsure of how to parallelize this for loop with the multiprocessing module

2 Answers2