-2

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.

Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).

https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3

This is the website and below is the link:

BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://bidplus.gem.gov.in/bidlists"

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)
iLuvLogix
  • 5,920
  • 3
  • 26
  • 43
Deepak Jain
  • 137
  • 1
  • 3
  • 27
  • 7
    Good answers require good questions, please help make your problem comprehensible to all by improving your question --> What do your previous attempts look like and where are you not getting any further? Thanks – HedgeHog Dec 30 '21 at 11:29
  • 1
    edited pls check @HedgeHog – Deepak Jain Dec 30 '21 at 11:31
  • 1
    is your site only available in india? – Sergey K Dec 30 '21 at 11:35
  • No really new information - Check again: What do your previous attempts look like and where are you not getting any further? – HedgeHog Dec 30 '21 at 11:35
  • @SergeyK you can use proxy server website – Deepak Jain Dec 30 '21 at 11:54
  • @HedgeHog i don't know how to download the files which are present inside a class – Deepak Jain Dec 30 '21 at 11:56
  • 2
    If you have no idea at all how it could work, [this post](https://stackoverflow.com/questions/54616638/download-all-pdf-files-from-a-website-using-python) might help you - The point is simply that we are happy to help you with a particular problem you're stuck with, but we can't knit a ready-made solution without your previous attempts. – HedgeHog Dec 30 '21 at 12:25
  • @HedgeHog code edited – Deepak Jain Dec 30 '21 at 12:38
  • 5
    Sorry, this is just a copy of the accepted answer from the link I postet - **This behavior is not okay and do not show any effort** - I am out – HedgeHog Dec 30 '21 at 12:46
  • 1
    "need answer as soon as possible" then pay with dollars not rep. i don't even understand you can use this as a valid bounty message. and now i see that op is showing no effort at all, i don't understand the upvotes. – diggusbickus Jan 07 '22 at 19:46

2 Answers2

4

You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.

The links can be matched with the following selector:

.bid_no > a

That is anchor (a) tags with direct parent element having class bid_no.

This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.

An example of some of the dictionary entries:

enter image description here


As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.

Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:

.pagination li:last-of-type > a

That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.

Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.


TODO:

I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.

I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.


import requests
from bs4 import BeautifulSoup as bs

end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'

with requests.Session() as s:
    while True:
        r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
        soup = bs(r.content, 'lxml')
        for i in soup.select('.bid_no > a'):
            pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
        #print(pdf_links)
        if current_page == 1:
            num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
            print(num_pages)
        if current_page == num_pages or current_page > end_number:
            break
        current_page+=1
    
for k,v in pdf_links.items():
    with open(f'{path}/{k}.pdf', 'wb') as f:
        r = s.get(v)
        f.write(r.content)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    FOR SOME REASON ALL PDF ARE NOT DOWNLOADING.......... – Deepak Jain Jan 03 '22 at 16:27
  • Did you note my comment about the `or current_page > end_number` ? I put that in to stop at 3 pages. Did you remove that bit of code to get all results or set end_number higher? – QHarr Jan 03 '22 at 17:45
  • I DID NOT COMMENT BUT I HAVE PUT 822 IN END_NUMBER IT SHOULD HAVE 8220 PDF BUT ITS ONLY HAVING 4412 RECORDS. I HAVE ALSO CHECKED IT'S GOING TO LAST PAGE AND PICKING LAST PAGE RECORDS BUT IT'S NOT PICKING SOME PDF'S OF SOME PAGES IN BETWEEN FIRST AND LAST PAGE – Deepak Jain Jan 04 '22 at 05:15
  • Is it the same pdfs missed each time or different pdfs? Did you check the length of `pdf_links` ? – QHarr Jan 08 '22 at 02:38
  • pdf_links len is coming as 4416 – Deepak Jain Jan 08 '22 at 15:14
  • And the value of `num_pages` ? – QHarr Jan 08 '22 at 15:29
  • 893 vale of num_page @qhar – Deepak Jain Jan 08 '22 at 16:24
  • Can you narrow down which pages are missing entries by printing the `print(len(soup.select('.bid_no > a')))` along with `current_page` during the loop. Be sure to print before looping over the `a` tags. Or better still create a dictionary of current_page:len(bids) and then look at what the various lengths are by cfiltering on values <10. – QHarr Jan 08 '22 at 17:49
  • i have put print(len(soup.select('.bid_no > a'))) print(current_page) after #print(pdf_link) it is printing all pages and 10 – Deepak Jain Jan 09 '22 at 06:11
  • If the len is 10 through out and it loops all pages I don't see how you can have 4416 from 893 pages unless some bids are repeated? – QHarr Jan 09 '22 at 11:32
  • Create a separate global list variable called `links`. Append to that list after the current line: `pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']` i.e. so you have `links.append('https://bidplus.gem.gov.in' + i['href'])` inside that loop as a line underneath. Then at the end check the `len(links)` and then `len(set(links))` to see if there are duplicates. – QHarr Jan 09 '22 at 11:36
  • @QHarr len(set(links)) is coming as 624 and len(links) ls 801 – Deepak Jain Jan 09 '22 at 13:20
  • @QHarr len(set(links)) is coming as 624 and len(links) ls 801 pdf_links is coming as 3707 – Deepak Jain Jan 09 '22 at 13:26
  • So, there are duplicates which wouldn't appear in the dictionary as separate entries. They would overwrite. I would still expect however, len() or len(set()) to match the len of the pdf_links items. I wonder if the text can be duplicates as well across different keys. If you only extract the bid number and add that instead to the [] links variables (instead of the actual link itself) what is the difference between len() or len(set()) ? – QHarr Jan 09 '22 at 13:58
  • @QHarr any idea how to get all records (duplicates including) – Deepak Jain Jan 09 '22 at 15:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240893/discussion-between-qharr-and-akhil-jain). – QHarr Jan 09 '22 at 15:20
  • len(pdf_descs) and len(pdf_links) is coming as 8728 still only 4446 pdf are coming with code u pasted in chat @QHarr – Deepak Jain Jan 09 '22 at 17:55
  • @QHarr is there any way to get all records duplicate or not – Deepak Jain Jan 16 '22 at 06:47
0

Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:

url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
    for pdf in bid_no.find_all('a'):
        with open('pdf_name_here.pdf', 'wb') as f:
            #if you have full link
            href = pdf.get('href')
            #if you have link exept full path, like /showbidDocument/2993132
            #href = url + pdf.get('href')
            response = requests.get(href)
            f.write(response.content)
Sergey K
  • 1,329
  • 1
  • 7
  • 15