0

I want to lower the time it takes for a for loop to complete using multiprocessing but I am not sure how to go about it explicitly as I have not seen any clear basic usage pattern for the module that I can apply to this code.

    allLines = fileRead.readlines()
    allLines = [x.strip() for x in allLines]
    for i in range (0,len(allLines)):
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')

EDIT: New code but it's still broken...

allLines = fileRead.readlines()
allLines = [x.strip() for x in allLines]
def f(indexes, allLines):
    for i in indexes:
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')
for i in range(threads):
    indexes = range(i*len(allLines), i*len(allLines)+threads, 1)
    Thread(target=f, args=(indexes, allLines)).start()
leon
  • 1
  • 1
  • What is `currentLine` ? – politinsa Jun 24 '19 at 09:57
  • @politinsa I have a file full of words in individual lines and allLines is an array of those words while currentLine is a name I gave to reference each item or "line" as I indent it to move to the next one. – leon Jun 24 '19 at 10:36
  • Are you using python 2 or 3? – PyPingu Jun 24 '19 at 11:11
  • @PyPingu I am using python 3.7.3 – leon Jun 24 '19 at 12:11
  • Did you try my answer? If it worked I can adjust it to use `concurrent.futures.ThreadPoolExecutor` – PyPingu Jun 24 '19 at 12:17
  • @PyPingu I tried your answer and it worked beautifully! [EDIT: removed some nonsense about not printing] – leon Jun 24 '19 at 12:30
  • Hmm. Does it throw any error? It should be writing into whatever is `fileA`. Perhaps print out free_words to check it’s actually a list of the words – PyPingu Jun 24 '19 at 12:32
  • @PyPingu Nevermind that, I had left out a whole part of your solution. It works now. – leon Jun 24 '19 at 12:34
  • What are the timings like? How many URLs are you checking? – PyPingu Jun 24 '19 at 12:40
  • I am checking 58000 URLs and it takes over 4 seconds between each check. As far as I can tell all that is happening is the different threads are taking turns. – leon Jun 24 '19 at 12:41
  • So if you run it in a simple for-loop versus using the thread pool the total time is the same? – PyPingu Jun 24 '19 at 13:02
  • what are `uReq` and `soup` variables? if they're `urllib` and "Beautiful Soup" respectively then you probably don't want to use threads as I don't think either [releases the GIL](https://stackoverflow.com/q/1294382/1358308). BS4 is also **very** slow at parsing HTML, so you might want to try another library. that said, you could try using processes with multiprocessing as that might give you more concurrency. – Sam Mason Jun 24 '19 at 13:09
  • 1
    Ok, if what @SamMason says about the GIL is true that would explain things. To use multiple processes as he says - replace `multiprocessing.dummy` with just `multiprocessing`. – PyPingu Jun 24 '19 at 13:13
  • @PyPingu I had a suspicion this was the case. Used multiprocessing instead and it appears that it is stuck and not returning any output. My CPU usage increases significantly while running. – leon Jun 24 '19 at 13:56
  • Yeah well if you left it the pool with 10 workers and you only have 1 CPU it has possibly locked up your PC. You can try reducing your number of workers to 4 or something. The other possibility is to change the HTML parser you use as Sam suggetsed. You could try the [`lxml`](https://lxml.de/) module which is supposed to be fast. You could also just use threading for doing the requests. I'd use the [`requests`](https://2.python-requests.org/en/master/) module and just return a string of the page content and do all the parsing afterwards. Depends on where the bottleneck really is – PyPingu Jun 24 '19 at 14:02
  • I have 4 cores and I already reduced the number to 4. When I check the running threads it tells me I have 5 threads running, so I reduced the number to 3 but it still doesn't work as intended. I will look at the modules and methods you suggested. Thank you. – leon Jun 24 '19 at 14:05

2 Answers2

0
  • Put the code in a function
  • Split indexes
  • Start threads
from threading import Thread

THREADS = 10

allLines = fileRead.readlines()
allLines = [x.strip() for x in allLines]

def f(indexes, allLines):
    #This entire for loop needs to be parallelized
    for i in indexes:
        currentWord = allLines[currentLine]
        currentLine += 1
        currentURL = URL+currentWord
        uClient = uReq(currentURL)
        pageHTML = uClient.read()
        uClient.close()
        pageSoup = soup(pageHTML,'html.parser')
        pageHeader = str(pageSoup.h1)
        if 'Sorry!' in pageHeader:
            with open(fileA,'a') as fileAppend:
                fileAppend.write(currentWord + '\n')
            print(currentWord,'available')
        else:
            print(currentWord,'taken')

for i in range(THREADS):
  indexes = range(i*len(allLines), i*len(allLines)+THREADS, 1)
  Thread(target=f, args=(indexes, allLines)).start()
politinsa
  • 3,480
  • 1
  • 11
  • 36
  • You have `args=(indexes, allLines)`, but `f(allLines, indexes)`. One of them should be swapped. – rtoijala Jun 24 '19 at 10:22
  • Yeah, I noticed that. Thank you @rtoijala it's fixed now. – leon Jun 24 '19 at 10:35
  • There is a slight problem with this code. It keeps printing/writing only the first word from the list. I have put the new/broken code in an edit to my question above. – leon Jun 24 '19 at 10:46
0

Difficult to know exactly where the problem might be occurring without seeing the real input and output.

You could try this using the multiprocessing.dummy module which is just a wrapper around the Threading module.

import multiprocessing.dummy

def parse_url(word):
    currentURL = URL+word
    uClient = uReq(currentURL)
    pageHTML = uClient.read()
    uClient.close()
    pageSoup = soup(pageHTML,'html.parser')
    pageHeader = str(pageSoup.h1)
    if 'Sorry!' in pageHeader:
        print(currentURL,'is available.')
        return word
    else:
        print(currentURL,'is taken.')
        return None

with open(fileR,'r') as fileRead:
    #This is just for printing two newlines? Could replace with a single print('\n')
    print('')
    print('')
    print(fileRead.name,fileRead.mode)
    with open(fileA,'w') as fileWrite:
        fileWrite.write('')
        print('')
        print('')
        print(fileWrite.name,'emptied.')
    allLines = fileRead.readlines()
    allLines = [x.strip() for x in allLines]

#Make a pool of 10 worker threads
with multiprocessing.dummy.Pool(10) as pool:
    result = pool.map_async(parse_url, allLines)
    #wait for all the URLs to be checked
    word_list = result.get()
    free_words = [x for x in word_list if x is not None]

with open(fileA,'w') as fileAppend:
    fileAppend.write('\n'.join(free_words))
PyPingu
  • 1,697
  • 1
  • 8
  • 21