0

I have a CSV file that contains 3 Columns: Forename, Surname and Date of Death. I need to parse each line of the CSV, extract the individual parts of the date of death, and then create a custom URL that I can then send as a request to a website. The response then needs to be used to extract data from a HTML table that is produced from this request. This extracted data should then be stored in either a CSV or txt file.

How would I make this more efficient via parallelisation, as there is a decent number of lines in this file that need processing?

The original version of this program has been made and works in Java. I am looking to move it over to Python as I want to learn the language and I've heard that it is more efficient.

This is relevant code that I have in Python so far. However, this isn't working at the moment, the error thrown is:

File "<ipython-input-23-17025eccd9eb>", line 4
    print 'line[{}] = {}'.format(i, line)
                        ^
SyntaxError: invalid syntax
import csv

with open("List_new.csv", "r") as f:
    reader = csv.reader(f, delimiter=", ")
    for i, line in enumerate(reader):
        print 'line[{}] = {}'.format(i, line)

Essentially, I wish to go through the CSV file line by line, extract the relevant data, form a Custom URL for each line, and then send a HTTP request that can then be processed, while also asynchronously sending the requests out to speed up the process.

Any help would be much appreciated!

jmctiernan
  • 13
  • 4
  • You're probably running this code with Python 3, where `print` has become a function, not a statement, as it was in Python 2, so you should call it: `print('hello')` – ForceBru Oct 03 '19 at 13:44
  • @ForceBru Thanks! That is working now! – jmctiernan Oct 03 '19 at 13:46
  • 2
    When it comes to speeding it up, it depends on what the slow part is. I assume it's hitting the URL. I have solved a similar problem before using `multiprocessing` and `queue` where I have on worker read each item into queue and maybe transform for the URL, then a group of workers to hit the urls and add the results to another queue, and a final worker writing out the results. Likely the read and write is fast, it's just the URL hit and parse that is the limitation. You will probably also get rate limited by where you are hitting the URL which you can't solve for really. – MyNameIsCaleb Oct 03 '19 at 13:48
  • @MyNameIsCaleb Ah okay, so essentially use one thread/worker for the CSV parsing, multiple for the HTTP requests, and then one to process those outputs to an output file? – jmctiernan Oct 03 '19 at 13:49
  • 1
    Yes but all are processes not threads. The bottom of the multiprocessing page in the docs has a functioning example of this concept – MyNameIsCaleb Oct 03 '19 at 13:51
  • For HTTP requests I believe you'd be using multithreading not multiprocessing. The latter is limited by the number of processors your machine has and is appropriate for CPU bound tasks, while HTTP requests are I/O bound. For the difference between CPU bound and I/O bound [see here](https://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean). – M3RS Oct 03 '19 at 14:05
  • you could look in to aiohttp its a package for async requests – Bendik Knapstad Oct 03 '19 at 14:11
  • @BendikKnapstad I have seen that mentioned while looking around since posting this question so I'll definitely take a look! – jmctiernan Oct 03 '19 at 14:14

0 Answers0