0

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:

#!/usr/bin/env python

import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os

csvFileName = sys.argv[1]


with open(csvFileName, 'r') as inputFile:
    parsedFile = csv.DictReader(inputFile, delimiter=',')
     totalCount = 0
     for row in parsedFile:
         target = row['new']
         source = row['old']
         systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
         os.system(systemLine)
         totalCount += 1
         print "\nProcessed number: " + str(totalCount)

I'm not sure how to optimize this script. Should I use something besides DictReader?

I have to use Python 2.7, and cannot upgrade to Python 3.

alexgbelov
  • 3,032
  • 4
  • 28
  • 42
  • 1
    The problem does not lie in how you're reading the CSV, but rather that you're shelling out to `curl` for each row of the file. Instead: 1. use native Python code to retrieve the URL and 2. use multithreading to make multiple requests at once. – kindall Jul 25 '17 at 19:10
  • Is there anything else I can do? I'm new to python, and I don't want to start messing about with multithreading. – alexgbelov Jul 25 '17 at 19:31
  • 1
    No. 99% of the runtime of the script is your script on the Web request, because you are waiting for each to complete before starting the next. To avoid this, you must run more than one at a time. – kindall Jul 26 '17 at 00:23

2 Answers2

0
  1. If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like

    $ python your_script.py 1.csv &
    $ python your_script.py 2.csv & 
    

Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.

Anyway it's much better to stick with multiprocessing, ofc.

  1. What about to use requests instead of curl?

    import requests
    response = requests.get(source_url)
    html = response.content
    with open(target, "w") as file:
        file.write(html)
    

Here's the doc.

  1. Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.
juggernaut
  • 748
  • 1
  • 10
  • 19
0

running

subprocess.Popen(systemLine)

instead of

os.system(systemLine)

should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

f1nan
  • 156
  • 10