2

I have a function that processes one url at a time:

def sanity(url):
    try:
       if 'media' in url[:10]:
           url = "http://dummy.s3.amazonaws.com" + url
       req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
       ret = urllib.request.urlopen(req)
       allurls.append(url)
       return 1
    except (urllib.request.HTTPError,urllib.request.URLError,http.client.HTTPException, ValueError) as e:
       print(e, url)
       allurls.append(url)
       errors.append(url)
       return 0

In the main function, I have a list of URLs that need to be processed by the above function. I have tried but doesn't work.

start=0
allurls=[]
errors=[]
#arr=[0,100,200...]
for i in arr:
    p=Process(target=sanity,args=(urls[start:i],))
    p.start()
    p.join()

The above code is supposed to process the URLs in a batch of 100. But it doesn't work. I know it's not working because I am writing the lists allurls and errors to two different files and they are empty when they should not be. I have found that the lists are empty. I don't understand this behavior.

martineau
  • 119,623
  • 25
  • 170
  • 301
Eswar
  • 1,201
  • 19
  • 45
  • 2
    The biggest issue with your code is the `p=Process`, `p.start()` and `p.join` within your for loop. You are creating a process, starting it and then ending it within the loop, which means each process runs on its own in serial, not parallel as you hoped. See my answer below on first splitting the data into chunks and then processing it (the `map` function works better here - gives you less rope to hang yourself with. – Jurgen Strydom Mar 25 '19 at 09:06
  • My [answer](https://stackoverflow.com/a/39055993/355230) to a related question may help. – martineau Mar 25 '19 at 09:47

1 Answers1

3

If I understand you correctly, you want to process chunks of a list at a time, but process those chunks in parallel? Secondly you want to store the answers in a global variable. Problem is processes are not threads, so you much more involved to share memory between them.

So the alternative is to return the answer, the below code helps you do just that. First you need to convert your list to a list of lists, each list containing the data you would want to process in that chunk. You can then pass that list of lists to a function that processes each of those. The output of each chunk is a list of answers, and a list of errors (I'd recommend to convert this to a dict to keep track of which one threw an error). Then after the processes returns you can untangle the list of lists the create your list of answers and list of errors.

Here is the code that would achieve the above:

from multiprocessing import Pool

def f(x):
    try:
        return [x*x, None]  # 0 for sucess
    except Exception as e:
        return [None, e]  # 1 for failure

def chunk_f(x):
    output = []
    errors = []
    for xi in x:
        ans, err = f(xi)
        if ans:
            output.append(ans)
        if err:
            errors.append(err)
    return [output, errors]

n = 10  # chunk size
data = list(range(95))  # test data
data.extend(['a', 'b'])

l = [data[k*n:(k+1)*n] for k in range(int(len(data)/n+1))]

p = Pool(8)
d = p.map(chunk_f, l)

new_data = []
all_errors = []
for da, de in d:
    new_data.extend(da)
    all_errors.extend(de)
print(new_data)
print(all_errors)

You can also look at this stack overflow answer on different methods of chunking your data.

Jurgen Strydom
  • 3,540
  • 1
  • 23
  • 30
  • Edited to that both answers and errors are passed back and then untangled in the end. – Jurgen Strydom Mar 25 '19 at 09:13
  • I appreciate your answer but I was looking to use global lists instead passing and return them. I find it to be the other problem in the code. – Eswar Mar 25 '19 at 09:16
  • Globals are rarely a good idea, because you can have multiple threads updating the same object at the same time. This sounds like a disaster waiting to happen. – Jurgen Strydom Mar 25 '19 at 09:17
  • See [this SO thread](https://stackoverflow.com/questions/18778187/multiprocessing-pool-with-a-global-variable), you can do what you want to do but it is much more involved. Processes are not threads, so they do not share memory out of the box. Globals are copied and can't be written back to it seems. – Jurgen Strydom Mar 25 '19 at 09:27
  • 1
    yeah. So initially tried with manager.lists for shared lists between processes. But I find that your code to be better. The manager.lists didn't work as expected. – Eswar Mar 25 '19 at 09:32
  • I have used your idea and run the code of chunk of 1000 urls in 15k urls list. It takes at least 15min which should not be the case. The code runs successfully, but I am wondering if it's happening in parallel or the overhead is causing this delay. – Eswar Mar 25 '19 at 10:38