0

I would like to scrape the content of Hong Kong's legislation. However, I have trouble accessing the content that are not visible unless I scroll the page down.

The website I'm accessing: https://www.elegislation.gov.hk/hk/cap211

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import ElementNotVisibleException
from selenium.webdriver.common.action_chains import ActionChains

def init_driver(profile):
    driver = webdriver.Firefox(profile)
    driver.wait = WebDriverWait(driver, 5)
    return driver

def convert2text2(webElement):
    if webElement != []:
        webElements = []
        for element in webElement:
            e = element.text.encode('utf8')
            webElements.append(e)
    else:
        webElements = ['NA']
    return webElements

profile = webdriver.FirefoxProfile()
driver = init_driver(profile)
url = 'https://www.elegislation.gov.hk/hk/cap211'
driver.get(url)
driver.wait = WebDriverWait(driver, 5)

content = driver.find_elements_by_xpath("//div[@class='hklm_content' or @class='hklm_leadIn' or @class='hklm_continued']")
content = convert2text2(content)

Understand that the following code taken from How can I scroll a web page using selenium webdriver in python? is used for scrolling the browser:

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

But I couldnt figure out how to specify the scroll bar of the content window and scroll to the bottom of it.

Seamus Lam
  • 155
  • 3
  • 10

1 Answers1

1

You just put last_height in the javascript code like so:

while True:
    # Scroll down to 'last_height'
    driver.execute_script("window.scrollTo(0, {});".format(last_height))

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight;")
    if new_height == last_height:
        break
    last_height = new_height

Another way of going about this would be simply pulling the data out without selenium. If you view the calls the page make (Chrome inspector, the Network tab), you'll see, that each new element is loaded into the site using small chunks of xml.

The url for the starting point is 'https://www.elegislation.gov.hk/xml?skipHSC=true&LANGUAGE=E&BILINGUAL=&LEG_PROV_MASTER_ID=181740&QUERY=.&INDEX_CS=N&PUBLISHED=true'

The PROV_MASTER_ID-parameter will increase by 1 for each chunk that the site loads.

You could grab it all using requests like so:

import requests
url = 'https://www.elegislation.gov.hk/xml?skipHSC=true&LANGUAGE=E&BILINGUAL=&LEG_PROV_MASTER_ID={}&QUERY=.&INDEX_CS=N&PUBLISHED=true'
starting_count = 181740
stop_count = "" # integer - you need to figure out, when you got all you need
count = starting_count
while count <= stop_count:
    response = requests.get(url.format(count))
    # parse the xml and grab the parts you need...
    count +=1
jlaur
  • 740
  • 5
  • 13
  • I suspect you code will error still though (depending on oython version). I py3 the .text is already utf-8 encoded as strings are unicode per default. – jlaur Jul 03 '17 at 12:59
  • Your first solution did not work for me. But thanks for suggesting solution 2. after some modification to your suggested solution i was able to access the content. – Seamus Lam Jul 04 '17 at 06:24
  • sorry, missed a ";" at the end of the javascript. I didn't run the code. It works now - but not for the particular site as the content you're after is inside a frame. So if you want help entering the frame and scroll inside that please check SO for questions on that - if not present post a new question . – jlaur Jul 04 '17 at 07:56