0

I am writing a little script which loops through a .csv, stores each row in the file as a dictionary, and fires off that dictionary to an API in a 1-dimensional list.

import csv
import requests

with open('csv.csv', 'rU') as f:
    reader = csv.reader(f, skipinitialspace=True)
    header = next(reader)
    for row in reader:
        request = [dict(zip(header, map(str, row)))]
        r = requests.post(url, headers = i_headers, json = request)
        print str(reader.line_num) + "-" + str(r)

The request list looks something like this:

[
    {
        "id": "1", 
        "col_1": "A",
        "col_2": "B",
        "col_3": "C"
    }
]

This script works, but I'm looping through an 8 million row .csv, and this method is simply too slow. I would like to speed up this process by sending more than one row per API call. The API I'm working with allows me to send up to 100 rows per call.

How can I change this script to incrementally build lists containing 100 dictionaries, post that to the API and then repeat. A sample of what I'd be sending to this API would look like this:

[
    {
        "id": "1", 
        "col_1": "A",
        "col_2": "B",
        "col_3": "C"
    },
    {
        "id": "2", 
        "col_1": "A",
        "col_2": "B",
        "col_3": "C"
    },
...
...
...
    {
        "id": "100", 
        "col_1": "A",
        "col_2": "B",
        "col_3": "C"
    }
]

One thing that won't work is to build a massive list and then partition it into n lists of size 100. The reason being that my machine cannot hold all of that data in memory at any given time.

sumojelly
  • 13
  • 2

2 Answers2

1

You can create a list of requests, and whenever its size is big enough, send it to the API:

import csv
import requests

with open('csv.csv', 'rU') as f:
    reader = csv.reader(f, skipinitialspace=True)
    header = next(reader)
    requestList = []
    for row in reader:
        requestList.append(dict(zip(header, map(str, row))))
        if len(requestList) >= 100:
            r = requests.post(url, headers = i_headers, json = requestList)
            print str(reader.line_num) + "-" + str(r)
            requestList = []

Then, you just need to take care, that you also call the API for the last, non-full list. Can either be done by calling the API with the remaining list after the loop, or the CSV reader can tell you whether it's the last row.

1

It is possible to do this by using range(100) and except StopIteration:, but it's not very pretty. Instead, a generator is perfect for getting chunks of 100 rows at a time from your CSV file. As it doesn't clutter up your actual iteration and request logic, it makes for fairly elegant code. Check it:

import csv
import requests
from itertools import islice

def chunks(iterator, size):
    iterator = iter(iterator)
    chunk = tuple(islice(iterator, size))
    while chunk:
        yield chunk
        chunk = tuple(islice(iterator, size))

with open('csv.csv', 'rU') as f:
    reader = csv.reader(f, skipinitialspace=True)
    header = next(reader)
    for rows in chunks(reader, 100):
        rows = [dict(zip(header, map(str, row))) for row in rows]
        r = requests.post(url, headers=i_headers, json=rows)
        print str(reader.line_num) + "-" + str(r)

I'm not entirely sure where you're getting i_headers from, however, but I assume you've got that figured out in your actual code.

obskyr
  • 1,380
  • 1
  • 9
  • 25