5

I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:

The code is working fine, only the issue is running one by one for loc_var, and took so long time.

Want to access all the for loop loc_var URL in parallel and write operation of CSV

Below is the Code:

import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]

for s,web,port,usr in server:
    login_url='http://'+web+port+'/api/v1/system/login/?'+usr
    print (login_url)
    s= requests.session()
    login_response = s.post(login_url)
    print("login Responce",login_response)
    #Start access the Web for Loc_variable
    for mkt in loc_var:
        #output is CSV File
        com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
        print("com_action_url",com_actions_url)
        r = s.get(com_actions_url)
        print("action",r)
        if r.ok == True:            
            with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
                f.write(r.content)  

        # If loc is not aceesble try with another Web_1 List
        if r.ok == False:
            while r.ok == False:
                for web_2 in web_1:
                    login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
                    com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
                    login_response = s.post(login_url)
                    print("login Responce",login_response)
                    print("com_action_url",com_actions_url)
                    r = s.get(com_actions_url)
                    if r.ok == True:            
                        with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
                            f.write(r.content)  
                        break
DHANANJAY CHAUBEY
  • 107
  • 1
  • 2
  • 10

1 Answers1

19

There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.

To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:

from concurrent.futures import ThreadPoolExecutor
import requests

def fetch(url):
    page = requests.get(url)
    return page.text
    # Catch HTTP errors/exceptions here

pool = ThreadPoolExecutor(max_workers=5)

urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']  # Create a list of urls

for page in pool.map(fetch, urls):
    # Do whatever you want with the results ...
    print(page[0:100])

Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):

import asyncio
import aiohttp

urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()
        # Catch HTTP errors/exceptions here

async def fetch_concurrent(urls):
    loop = asyncio.get_event_loop()
    async with aiohttp.ClientSession() as session:
        tasks = []
        for u in urls:
            tasks.append(loop.create_task(fetch(session, u)))

        for result in asyncio.as_completed(tasks):
            page = await result
            #Do whatever you want with results
            print(page[0:100])

asyncio.run(fetch_concurrent(urls))

But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).

J. Taylor
  • 4,567
  • 3
  • 35
  • 55
  • I know this is old, but why are you setting the max workers equal to 5? Would you mind sharing how we would find out what to set it at? Besides trial and error? Leaving the max workers blank improved performance from 140s to 30s – SCCJS Sep 28 '21 at 16:53
  • 1
    It has been a long time since I was doing any of this, but I would guess it was based on not wanting to send out requests too quickly? (at the time I was writing a lot of scripts to scrape data from websites, and needed to limit the # of requests to prevent being blocked) ... but yeah, you're probably right that in general, there wouldn't always be a reason to limit the # of concurrent requests, and that it would be faster without the limit – J. Taylor Sep 29 '21 at 20:06
  • [link](https://stackoverflow.com/questions/49005651/how-does-asyncio-actually-work/51116910#51116910) To anyone else who finds this, this turned out to be an extremely complicated long answer. Read the link above if you would like to know more. But TLDR I believe it is best to not specify unless you have an advanced implementation that requires it – SCCJS Oct 11 '21 at 21:09