Tables scape on Multiple Pages of website with Single URL with Python

Question

I am trying to scrape data from ccil website . The tables are split into various pages but all of the pages use the same url. I am using pandas and BeautifulSoup to parse the HTML code and I am able to scrape the initial table only , but I want the entire data from all tables.

Note this website shows data at particular time.

My link is : - https://www.ccilindia.com/OMMWSG.aspx

I have also seen similar query on stackoverflow, program is also working, but I did not understand from where "data" part is taken.

Scrape Tables on Multiple Pages with Single URL

I’m voting to close this question because the Privacy Policy of the site states that "No part of the information on this website, including text and graphics, may be reproduced or transmitted in any form by any means without the express written consent of CCIL", which you are trying to violate. — yedpodtrzitko, Sep 02 '21 at 10:38
Please don't make more work for others by vandalizing your posts. By posting on the Stack Exchange (SE) network, you've granted a non-revocable right, under a [CC BY-SA license](//creativecommons.org/licenses/by-sa/4.0), for SE to distribute the content (i.e. regardless of your future choices). By SE policy, the non-vandalized version is distributed. Thus, any vandalism will be reverted. Please see: [How does deleting work? …](//meta.stackexchange.com/q/5221). If permitted to delete, there's a "delete" button below the post, on the left, but it's only in browsers, not the mobile app. — Makyen, Sep 17 '21 at 08:09

score 0 · Accepted Answer · answered Sep 02 '21 at 12:08

I have written a simple Selenium script to scrape the table data and navigate through the pages.

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="<PATH TO YOUR CHROMEDRIVER>")
url = "https://www.ccilindia.com/OMMWSG.aspx"
driver.get(url)

time.sleep(2)

# This dictionary will hold all the data for each page.
row_info = {}

def next_page(page):
    if page == 1:
        next_page = driver.find_element_by_xpath("/html/body/form/table[5]/tbody/tr[1]/td/table/tbody/tr[27]/td/a[1]")
    elif page == 2:
        print("Moving to last page")
        next_page = driver.find_element_by_xpath("/html/body/form/table[5]/tbody/tr[1]/td/table/tbody/tr[27]/td/a[2]")
    else:
        print("Last page reached, closing...")
        return None
    webdriver.ActionChains(driver).move_to_element(next_page).click().perform()


for page in range(1,4):
    print("Current page:", page)

    # After trial and error,
    # I found that these elements contain all the required data in a single page
    table_row = driver.find_elements_by_tag_name("tr")[5]
    td = table_row.find_elements_by_tag_name("td")[0].text

    # Creates a dictionary Key for current page and adds table data as Value
    row_info[f"page_{page}"] = td

    time.sleep(2)
    next_page(page)
    time.sleep(2)

print("---")
print(row_info["page_1"])
print("---")
print(row_info["page_2"])
print("---")
print(row_info["page_3"])

driver.close()

The data that is saved to each dictionary entry is not formatted, so you will have something like this for each page:

Security Description Maturity Date Bid Amt. (Cr.) Bid Yield Bid Price Offer Price Offer Yield Offer Amt. (Cr.) LTP LTY LTA TTA (Cr.)
08.26 MH SDL 2029 02/01/2029 0.00 0.0000 0.0000 0.0000 0.0000 0.00 109.0500 6.6761 5.00 5.00
08.57 HR SDL 2028 04/07/2028 0.00 0.0000 0.0000 0.0000 0.0000 0.00 110.3950 6.6501 5.00 5.00
08.35 GJ SDL 2029 06/03/2029 0.00 0.0000 0.0000 0.0000 0.0000 0.00 109.7000 6.6856 5.00 5.00
08.37 TN SDL 2029 06/03/2029 0.00 0.0000 0.0000 0.0000 0.0000 0.00 110.0500 6.6479 5.00 5.00
08.38 GJ SDL 2029 27/02/2029 0.00 0.0000 0.0000 0.0000 0.0000 0.00 109.8500 6.6853 5.00 5.00
1 2 3

The last line 1 2 3 is the page numbers included. So, you will have to format it yourself to fit your needs.

How to do this without selenium ? and just by using beautiful soup , requests and pandas ..in the other link given by me, selenium is not used...also how to install webdriver ? — Tech Bro, Sep 02 '21 at 15:07
Unfortunately, I do not know how to navigate through links with `requests` library if the actual url does not lead to a different page. So, I can only offer a `Selenium` solution. You can download the Chrome webdriver [here](https://sites.google.com/a/chromium.org/chromedriver/downloads). Make sure to choose the one that corresponds to your Chrome browser version. — Alper, Sep 02 '21 at 17:04
Yeah your program works smoothly , really thanks for it . But my crome browser settings and webpage also remain open, how to close browser after data is taken ? , Also how to fit the data properly ? i tried it with no results...I am also trying to increase pages , as in program there are 3 pages , I I have copy pasted it for 15-20 pages .. when page not found then how to handle error ? — Tech Bro, Sep 03 '21 at 05:28
To increase pages, you need to add another `elif` statement to `next_page` function for each page with their respective XPath locations, such as `elif page == 6:` then `next_page = driver.find_element_by_xpath("/html/body/form/table[5]/tbody/tr[1]/td/table/tbody/tr[27]/td/a[6]")`. For this to properly work, you need to manually copy each XPath with developer tools. [Step1](https://i.imgur.com/UiX78aD.jpg), [Step 2](https://i.imgur.com/efR2ERl.jpg) Note that before you copy XPath of page n, you need to be on page n-1. So, to copy link 6, make sure you are on 5th page to make sure it works fine — Alper, Sep 03 '21 at 09:57
Also, you need to increase the `for` loop to the amount of pages you are scraping. It is currently for 3 pages so it is `(1, 4)`. If you will be using it for a table with 10 pages, it should be `(1, 11)`. — Alper, Sep 03 '21 at 10:00

Tables scape on Multiple Pages of website with Single URL with Python

1 Answers1