1

I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.

Here is what I've written:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")

while True:
    for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
        print(name.text)

    try:
        n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
        driver.execute_script(n_link.get_attribute("href"))
    except: break

driver.quit()
SIM
  • 21,997
  • 5
  • 37
  • 109

2 Answers2

2

You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
        print(name.get_attribute('textContent'))

driver.quit()

You can also try this solution if it's not mandatory for you to use Selenium:

import requests
from lxml import html

r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)

for name in source.xpath("//table[@class='greygeneraltxt']//td[text() and position()>1]"):
        print(name.text)
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • You beat me to it... I don't think the first bit will work because you can't `get_attribute()` on an invisible element using Selenium. I was going to suggest that he use JSE, e.g. `.execute_script("return arguments[0].innerText", name)` in your case. – JeffC Oct 16 '17 at 16:19
  • 1
    @JeffC, `text` property doesn't allow to get content of hidden elements. `get_attribute('textContent')` [works fine for this purpose](https://stackoverflow.com/questions/43429788/python-selenium-finds-h1-element-but-returns-empty-text-string/43430097#43430097) – Andersson Oct 16 '17 at 16:20
  • Thanks sir Andersson for such a robust solution. Someday I'll come up with a difficult problem to solve cause you have been invincible. Thanks again. – SIM Oct 16 '17 at 16:42
  • @Andersson I wasn't sure but I'm glad that you've tested it and know it works. – JeffC Oct 16 '17 at 17:14
0

It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.

html_string=driver.page_source
SuperStew
  • 2,857
  • 2
  • 15
  • 27