0

I want to fetch data (JSON files only) from multiple URLs using requests.get(). The URLs are saved in a pandas dataframe column and I am saving the response in JSON files locally.

i=0
start = time()
for url in pd_url['URL']:
    time_1 = time()
    r_1 = requests.get(url, headers = headers).json()
    filename = './jsons1/'+str(i)+'.json'
    with open(filename, 'w') as f:
        json.dump(r_1, f)
    i+=1

time_taken = time()-start
print('time taken:', time_taken)


Currently, I have written code to get data one by one from each URL using for loop as shown above. However, that code is taking too much time to execute. Is there any way to send multiple requests at once and make this thing run faster?

Also, What are the possible factors that are delaying the responses?
I have an internet connection with low latency and enough speed to 'theoretically' execute above operation in less than 20 seconds. Still, the above code takes 145-150 seconds every time I run it. My target is to complete this execution in maximum 30 seconds. Please suggest workarounds.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343

2 Answers2

3

It sounds like you want multi-threading so use the ThreadPoolExecutor in the standard library. This can be found in the concurrent.futures package.

import concurrent.futures

def make_request(url, headers):
    resp = requests.get(url, headers=headers).json()
    return resp

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = (executor.submit(make_request, url, headers) for url in pd_url['URL']) 
    for idx, future in enumerate(concurrent.futures.as_completed(futures)):
        try:
            data = future.result()
        except Exception as exc:
            print(f"Generated an exception: {exc}")

        with open(f"./jsons1/{idx}.json", 'w') as f:
            json.dump(data, f)

You can increase or decrease the number of threads, specified as max_workers, as you see fit.

gold_cy
  • 13,648
  • 3
  • 23
  • 45
1

You can make use of multiple threads to parallelize your fetching. This article presents one possible way of doing that using the ThreadPoolExecutor class from the concurrent.futures module.

It looks like @gold_cy posted pretty much the same answer while I was working on this, but for posterity, here's my example. I've taken your code and modified it to use the executor, and I've modified it slightly to run locally despite not having handy access to a list of JSON urls.

I'm using a list of 100 URLs, and it takes about 125 seconds to fetch the list serially, and about 27 seconds using 10 workers. I added a timeout on requests to prevent broken servers from holding everything up, and I added some code to handle errors responses.

import json
import pandas
import requests
import time

from concurrent.futures import ThreadPoolExecutor


def fetch_url(data):
    index, url = data
    print('fetching', url)
    try:
        r = requests.get(url, timeout=10)
    except requests.exceptions.ConnectTimeout:
        return

    if r.status_code != 200:
        return

    filename = f'./data/{index}.json'
    with open(filename, 'w') as f:
        json.dump(r.text, f)


pd_url = pandas.read_csv('urls.csv')

start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
    for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
        pass

    runner.shutdown()

time_taken = time.time()-start
print('time taken:', time_taken)

Also, What are the possible factors that are delaying the responses?

The response time of the remote server is going to be the major bottleneck.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • there's one minor difference between our two approaches to note. your threads handle writing out the file as well whereas my threads only perform the request and leave the writing of the file to the main thread where the program is run. not sure of the performance impact but I assume most of the I/O bottleneck is network latency. – gold_cy Apr 26 '20 at 12:49
  • I think writing after collecting like you do is probably the more performant solution, although as you say at least in my test network latency was the major issue, and I was trying to keep the original code as much as possible. – larsks Apr 26 '20 at 12:51
  • makes sense, I didn't time mine but wanted to let OP know of the slight differences between the approaches. I think either answer will suit them over their current process. – gold_cy Apr 26 '20 at 12:53