How would I make 10000 HTTP requests asynchronously in Python from a CSV file?

Question

I have a CSV file that contains 3 Columns: Forename, Surname and Date of Death. I need to parse each line of the CSV, extract the individual parts of the date of death, and then create a custom URL that I can then send as a request to a website. The response then needs to be used to extract data from a HTML table that is produced from this request. This extracted data should then be stored in either a CSV or txt file.

How would I make this more efficient via parallelisation, as there is a decent number of lines in this file that need processing?

The original version of this program has been made and works in Java. I am looking to move it over to Python as I want to learn the language and I've heard that it is more efficient.

This is relevant code that I have in Python so far. However, this isn't working at the moment, the error thrown is:

File "<ipython-input-23-17025eccd9eb>", line 4
    print 'line[{}] = {}'.format(i, line)
                        ^
SyntaxError: invalid syntax

import csv

with open("List_new.csv", "r") as f:
    reader = csv.reader(f, delimiter=", ")
    for i, line in enumerate(reader):
        print 'line[{}] = {}'.format(i, line)

Essentially, I wish to go through the CSV file line by line, extract the relevant data, form a Custom URL for each line, and then send a HTTP request that can then be processed, while also asynchronously sending the requests out to speed up the process.

Any help would be much appreciated!

You're probably running this code with Python 3, where `print` has become a function, not a statement, as it was in Python 2, so you should call it: `print('hello')` — ForceBru, Oct 03 '19 at 13:44
When it comes to speeding it up, it depends on what the slow part is. I assume it's hitting the URL. I have solved a similar problem before using `multiprocessing` and `queue` where I have on worker read each item into queue and maybe transform for the URL, then a group of workers to hit the urls and add the results to another queue, and a final worker writing out the results. Likely the read and write is fast, it's just the URL hit and parse that is the limitation. You will probably also get rate limited by where you are hitting the URL which you can't solve for really. — MyNameIsCaleb, Oct 03 '19 at 13:48
@MyNameIsCaleb Ah okay, so essentially use one thread/worker for the CSV parsing, multiple for the HTTP requests, and then one to process those outputs to an output file? — jmctiernan, Oct 03 '19 at 13:49
Yes but all are processes not threads. The bottom of the multiprocessing page in the docs has a functioning example of this concept — MyNameIsCaleb, Oct 03 '19 at 13:51
For HTTP requests I believe you'd be using multithreading not multiprocessing. The latter is limited by the number of processors your machine has and is appropriate for CPU bound tasks, while HTTP requests are I/O bound. For the difference between CPU bound and I/O bound [see here](https://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean). — M3RS, Oct 03 '19 at 14:05
you could look in to aiohttp its a package for async requests — Bendik Knapstad, Oct 03 '19 at 14:11
@BendikKnapstad I have seen that mentioned while looking around since posting this question so I'll definitely take a look! — jmctiernan, Oct 03 '19 at 14:14

How would I make 10000 HTTP requests asynchronously in Python from a CSV file?

0 Answers0