how to improve the Webscraping code speed by multi threading code python

Question

below is my code in which i am writing row by row (there are around 900 pages with 10 rows and 5 data in each row) is there any way to make this faster. currently it's taking 80 min to export the data into csv.Is there any way to make parallel Request to pages and make this code more efficient.

import requests
from urllib3.exceptions import InsecureRequestWarning
import csv

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs

f = csv.writer(open('GEM.csv', 'w', newline=''))
f.writerow(['Bidnumber', 'Items', 'Quantitiy', 'Department', 'Enddate'])


def scrap_bid_data():
    page_no = 1
    while page_no < 910:
        print('Hold on creating URL to fetch data...')
        url = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no)
        print('URL created: ' + url)
        scraped_data = requests.get(url, verify=False)
        soup_data = bs(scraped_data.text, 'lxml')
        extracted_data = soup_data.find('div', {'id': 'pagi_content'})
        if len(extracted_data) == 0:
            break
        else:
            for idx in range(len(extracted_data)):
                if (idx % 2 == 1):
                    bid_data = extracted_data.contents[idx].text.strip().split('\n')

                    bidno = bid_data[0].split(":")[-1]
                    items = bid_data[5].split(":")[-1]
                    qnty = int(bid_data[6].split(':')[1].strip())
                    dept = (bid_data[10] + bid_data[12].strip()).split(":")[-1]
                    edate = bid_data[17].split("End Date:")[-1]
                    f.writerow([bidno, items, qnty, dept, edate])

            page_no=page_no+1
scrap_bid_data()

try to use pandas with dictionary https://stackoverflow.com/questions/57000903/what-is-the-fastest-and-most-efficient-way-to-append-rows-to-a-dataframe/57001947#57001947 — Zaraki Kenpachi, Sep 11 '20 at 06:02
This code IO-bound - most time will be spent making HTTP requests - so the way to optimise it is to parallelise the HTTP requests using threads or similar. — snakecharmerb, Sep 11 '20 at 07:34
@snakecharmerb sorry no idea about THread could you pls help !! — Deepak Jain, Sep 11 '20 at 07:35
As an easy introduction to threads in python I would recommend this article: https://medium.com/better-programming/every-python-programmer-should-know-the-not-so-secret-threadpool-642ec47f2000 — Ottotos, Sep 11 '20 at 08:05

Booboo · Accepted Answer · 2020-09-11T15:15:06.837

1

I've restructured your code a bit to ensure that your CSV file is closed. I also got the following error message:

ConnectionError: HTTPSConnectionPool(host='bidplus.gem.gov.in', port=443): Max retries exceeded with url: /bidlists?bidlists&page_no=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000012EB0DF1E80>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

You should experiment with the NUMBER_THREADS value:

import requests
from urllib3.exceptions import InsecureRequestWarning
import csv
import concurrent.futures
import functools

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs


def download_page(session, page_no):
    url = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no)
    print('URL created: ' + url)
    resp = session.get(url, verify=False)
    return resp.text


def scrap_bid_data():
    NUMBER_THREADS = 30 # number of concurrent download requests
    with open('GEM.csv', 'w', newline='') as out_file:
        f = csv.writer(out_file)
        f.writerow(['Bidnumber', 'Items', 'Quantitiy', 'Department', 'Enddate'])
        with requests.Session() as session:
            page_downloader = functools.partial(download_page, session)
            with concurrent.futures.ThreadPoolExecutor(max_workers=NUMBER_THREADS) as executor:
                pages = executor.map(page_downloader, range(1, 910))
                page_no = 0
                for page in pages:
                    page_no += 1
                    soup_data = bs(page, 'lxml')
                    extracted_data = soup_data.find('div', {'id': 'pagi_content'})
                    if extracted_data is None or len(extracted_data) == 0:
                        print('No data at page number', page_no)
                        print(page)
                        break
                    else:
                        for idx in range(len(extracted_data)):
                            if (idx % 2 == 1):
                                bid_data = extracted_data.contents[idx].text.strip().split('\n')

                                bidno = bid_data[0].split(":")[-1]
                                items = bid_data[5].split(":")[-1]
                                qnty = int(bid_data[6].split(':')[1].strip())
                                dept = (bid_data[10] + bid_data[12].strip()).split(":")[-1]
                                edate = bid_data[17].split("End Date:")[-1]
                                f.writerow([bidno, items, qnty, dept, edate])
scrap_bid_data()

edited Sep 11 '20 at 15:15

answered Sep 11 '20 at 11:54

Booboo

38,656
3
37
60

by thread 30 it's written 700 records out of 8770 in excel @Booboo – Deepak Jain Sep 11 '20 at 13:33
What is the significance of that? – Booboo Sep 11 '20 at 13:47
Are there any exceptions? I couldn't even return one URL? You have `if len(extracted_data) == 0: break`. I will modify the code to keep track of number of pages actually processed when this is executed. – Booboo Sep 11 '20 at 13:58
I've updated the code to explicitly count the number of pages processed. Your original code only retrieves, I believe, pages 1 through 909, so that's what mine does. – Booboo Sep 11 '20 at 14:02
And why do I get "This site can’t be reached." errors with a browser and timeout errors with `requests` and you don't? – Booboo Sep 11 '20 at 14:07
use vpn with Indian server – Deepak Jain Sep 11 '20 at 14:07
with range(1,100) and max_thread=30 it worked 100% records written but if range is (1,500) and max_thread then it write 20-25% records – Deepak Jain Sep 11 '20 at 14:35
`range(1, n)` causes it to retrieve pages page_no = 1, page_no = 2 ... page_no = n - 1 into a list. Your code , however breaks immediately as soon as it finds the first page returned where `len(extracted_data) == 0`, which could be long before page_no `n - 1` has been processed in the loop. I have no control over your program logic. NUMBER_THREADS should only affect how quickly the n - 1 pages are downloaded. The code should now print: 'No data at page number', page_no. What number prints out? – Booboo Sep 11 '20 at 14:44
for range (1,200) code runs for 199 pages with records 1500 out of 1900 record then after that i get the exception Traceback (most recent call last): File "C:/Users/deepak jain/PycharmProjects/SeleniumTest/DEMO/Sample1.py", line 46, in scrap_bid_data() File "C:/Users/deepak jain/PycharmProjects/SeleniumTest/DEMO/Sample1.py", line 32, in scrap_bid_data if len(extracted_data) == 0: TypeError: object of type 'NoneType' has no len() – Deepak Jain Sep 11 '20 at 15:00
even though page 199th and 200th exist with data – Deepak Jain Sep 11 '20 at 15:00
And your original code does not get that error? It looks like your `extracted_data = soup_data.find('div', {'id': 'pagi_content'})` return no element at all. I have no access to this website so there is really very little I can do to debug your code. If you do not think that my code is downloading the pages correctly, then don't use it. If you think your coding logic could use come improving, then improve it. I suggest that you also test for `None` being returned and on either condition of `None` or length is 0, you print out the contents of the page to see what it is. I will update code. – Booboo Sep 11 '20 at 15:10
can't u use VPN ? @Boo – Deepak Jain Sep 11 '20 at 15:16
No, I can't at the present time. – Booboo Sep 11 '20 at 15:17

how to improve the Webscraping code speed by multi threading code python

1 Answers1

Linked