Fetching large paginated data from a REST API with Python

Question

I'm pulling data from a rest API. The problem is that the datasize is huge and so the response is paginated. I've gotten around it by first reading how many pages of data there are and then iterating the request for each page. The only problem here is that the total number of pages are around 1.5K, which take a huge amount of time to actually fetch and append to a CSV. Is there any faster workaround for this?

This is the endpoint I'm targeting: https://developer.keeptruckin.com/reference#get-logs

import requests
import json
import csv
url='https://api.keeptruckin.com/v1/logs?start_date=2019-03-09'
header={'x-api-key':'API KEY HERE'}
r=requests.get(url,headers=header)
result=r.json()
result = json.loads(r.text)
num_pages=result['pagination']['total']
print(num_pages)
for page in range (2,num_pages+1):
    r=requests.get(url,headers=header, params={'page_no': page})
    result=r.json()
    result = json.loads(r.text)
    csvheader=['First Name','Last Name','Date','Time','Type','Location']
    with open('myfile.csv', 'a+', newline='') as csvfile:
        writer = csv.writer(csvfile, csv.QUOTE_ALL)
        ##writer.writerow(csvheader)
        for log in result['logs']:
            username = log['log']['driver']['username']
            first_name=log['log']['driver']['first_name']
            last_name=log['log']['driver']['last_name']
            for event in log['log']['events']:
                start_time = event['event']['start_time']
                date, time = start_time.split('T')
                event_type = event['event']['type']
                location = event['event']['location']
                if not location:
                    location = "N/A"
                if (username=="barmx1045"  or username=="aposx001" or username=="mcqkl002" or username=="coudx014" or username=="ruscx013" or username=="loumx001" or username=="robkr002" or username=="masgx009"or username=="coxed001" or username=="mcamx009" or username=="linmx024" or username=="woldj002" or username=="fosbl004"):
                    writer.writerow((first_name, last_name,date, time, event_type, location))

You may use multiple thread. For reference see this question https://stackoverflow.com/questions/3033952/threading-pool-similar-to-the-multiprocessing-pool — todaynowork, May 30 '19 at 01:04

score 2 · Accepted Answer · answered May 30 '19 at 01:05

2

A first option: Most paginated responses have a page size that you can edit. https://developer.keeptruckin.com/reference#pagination Try updating the per_page field to 100 as opposed to the default 25 per pull.

A second option: potentially you can pull more than one page at a time by using multiple threads/processes and splitting up what portion of pages each is responsible for.

answered May 30 '19 at 01:05

Thomas Burke

1,105
9
20

looks like updating the per_page field worked for me! Thanks! – stardamore May 30 '19 at 13:50

Fetching large paginated data from a REST API with Python

1 Answers1