Parsing multiple XML using urrlib, lxml and multiprocessing

Question

I'm triying to speed up a script to scrape an XML which is obtained by making a request to an API with urllib. I have to make ~2.3 million requests, so it tooks ~8 hours without multiprocessing.

Without applying multiprocessing:

from urllib import request as rq
from lxml import etree

def download_data(id):
    data = []
    xml = etree.iterparse(rq.urlretrieve(url + id + ".xml")[0], events=('start', 'end'))
    for event, id_data in xml:
        if event == "start":
            try:
                data.append(id_data.get('value'))
            except:
                pass
    return data

with open("/path/to/file", "rt") as ids_file:
    ids = ids_file.read().splitlines()

data_dict = {id: download_data(id) for id in ids}

I've tried the following code:

from urllib import request as rq
from lxml import etree
from multiprocessing import Pool, cpu_count

def download_data(id):
    data = []
    xml = etree.iterparse(rq.urlretrieve(url + id + ".xml")[0], events=('start', 'end'))
    for event, id_data in xml:
        if event == "start":
            try:
                data.append(id_data.get('value'))
            except:
                pass
    return (id, data)

with open("/path/to/file", "rt") as ids_file:
    ids = ids_file.read().splitlines()

with Pool(processes=cpu_count()*2) as pool:
    dt = pool.map(download_data, ids)

data_dict = dict(dt)

I get the following error:

        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Any suggestions?

Thank you in advance!

multiprocess relys on the namespace, this should make it clearer: https://stackoverflow.com/questions/20360686/compulsory-usage-of-if-name-main-in-windows-while-using-multiprocessi — Maurice Meyer, Jun 11 '20 at 19:59
Does this answer your question? [Compulsory usage of if \_\_name\_\_=="\_\_main\_\_" in windows while using multiprocessing](https://stackoverflow.com/questions/20360686/compulsory-usage-of-if-name-main-in-windows-while-using-multiprocessi) — Maurice Meyer, Jun 11 '20 at 20:00

Parsing multiple XML using urrlib, lxml and multiprocessing

0 Answers0