How to order the result while web scraping with multiprocessing?

Question

I am writing a program for scraping data from multiple urls using multiprocessing. Here I stored all the URLs in the bonds_url list. It is working and I am getting output but the problem here is that output is in random orders. I want scraped data to be in the same order as the order of URLs in bonds_url. Is there any solution for that?

from requests_html import HTMLSession
import constants

bonds_url  =[]   
from multiprocessing import Pool
    
def f(url):        
    session = HTMLSession()
    response = session.get(url) 
    
    try:    
        data = [i.text.strip() for i in response.html.find(".value") ]
        bonds_values.append(float(data[0])) 
        print(data[0])             
    except:    
        data =  [i.text.strip() for i in response.html.find("div>span[data-reactid='31']")]
        bonds_values.append(float(data[0]))
        print(data[0])
    
if __name__ == '__main__':
    with Pool(len(bonds_url)) as p:
        p.map(f, bonds_url)

Either sort the results or implement a synchronization mechanism. Sorting should be much easier. — Michael Ruth, Sep 07 '21 at 08:58
Can you please explain how to do that? One thing I know is that we need to enumerate that URL list but I am confused about what to pass in 'f' then — Animesh Singh, Sep 07 '21 at 16:32
Welcome to SO. Please take the [tour](https://stackoverflow.com/tour), read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). The provided code doesn't run, it raises `ModuleNotFoundError: No module named 'constants'`, and once that's fixed it will raise `NameError: name 'bonds_values' is not defined`. I need to be able to reproduce your result in order to help you. Even just a subset of your input/output would be sufficient. — Michael Ruth, Sep 07 '21 at 17:06
Hold on, it appears that `map` [preserves order](https://stackoverflow.com/questions/41273960/python-3-does-pool-keep-the-original-order-of-data-passed-to-map). This means that `bond_values` will be ordered by `bonds_url` but your `print` statements will likely be out of order. — Michael Ruth, Sep 07 '21 at 17:12

Michael Ruth · Answer 1 · 2021-09-08T16:20:37.803

Solution

Change the printS in f to returnS in order to get the results of multiprocessing.Pool.map in order.

from multiprocessing import Pool

from requests_html import HTMLSession

import constants

bonds_url = []


def f(url):    
    session = HTMLSession()
    response = session.get(url) 
    try:
        data = [i.text.strip() for i in response.html.find(".value")]
    except:
        data =  [i.text.strip() for i in response.html.find("div>span[data-reactid='31']")]
    return float(data[0])


if __name__ == '__main__':
    with Pool(len(bonds_url)) as p:
        bond_values = p.map(f, bonds_url)

How to order the result while web scraping with multiprocessing?

1 Answers1

Solution