How do I scrape websites which don't return the source code using Python?

Question

I am trying to scrape the 'ASX code' for announcements made by companies on the Australian Stock Exchange from the following website: http://www.asx.com.au/asx/statistics/todayAnns.do

So far I have tried using BeautifulSoup with the following code:

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
parser = BeautifulSoup(response.content, 'html.parser')
print(parser)

However when I print this, it does not print the same as when I manually go onto the page and view the page source. I have done some googling and looked on stackoverflow and believe that this is due to Javascript running on the page which hides the html code.

However I am unsure how to go about getting around this. Any help would be greatly appreciated.

Thanks in advance.

I am completely unsure where to start with Selenium. I have found an example where it clicks buttons and provides the source code here: https://stackoverflow.com/questions/8960288/get-page-generated-with-javascript-in-python but I don't need to click buttons - I just need the source code. I will keep searching however. Thanks for the links @cricket_007. — James Ward, Nov 09 '17 at 01:16
The website is generated dynamically, other than using and finding their API to request the data you need or a browser emulator I can't think of a solution. — innicoder, Nov 09 '17 at 01:28
@ElvirMuslic is a browser emulator a viable option? Will selenium work? I have written a snippet of selenium code: `from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0 from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do') tickers = driver.find_elements_by_class_name("row") print(tickers)`. However I am pretty sure Selenium only works on Python 2 and I only have Python 3 — James Ward, Nov 09 '17 at 01:31
Definitely supports python 3. https://pypi.python.org/pypi/selenium — OneCricketeer, Nov 09 '17 at 02:17

score 3 · Accepted Answer · answered Nov 09 '17 at 07:29

3

Try this. All you need to do is let the scraper wait for some moments until the page is loaded cause you perhaps already noticed that the content is being loaded dynamically. However, upon execution you will get the left sided header of the table from that webpage.

import time
from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
time.sleep(8)

soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.row'):
    print(item.text)
driver.quit()

Partial results:

RLC
RNE
PFM
PDF
HXG
NCZ
NCZ

Btw, I've written and executed this code using python 3.5. So, no issues are there with latest version of python when it comes to bind selenium.

answered Nov 09 '17 at 07:29

SIM

21,997
5
37
109

Thank you so much. This is beautiful. I actually wrote a code very similar to this in the end, except I used re instead of bs4. I really appreciate it. Do you have any idea how I would sleep up the process of selenium if I wanted to do this on a mass scale? Thanks again! – James Ward Nov 09 '17 at 10:07
speed up the process* not sleep – James Ward Nov 09 '17 at 10:17
1

There's a wait until function. For instance, you can find that element by XPath or other, `from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0 from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0 ff = webdriver.Firefox() ff.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement"))) finally: ff.quit()` – innicoder Nov 09 '17 at 11:21
@ElvirMuslic Thank you. That is extremely helpful. – James Ward Nov 09 '17 at 11:56
1

@JamesWard I'm glad you found that accommodating. Here's the official documentation on explicit waits, http://selenium-python.readthedocs.io/waits.html#explicit-waits You can also use implicit (meaning it's the same as sleep(5)). There you can find all sorts of examples and these examples are made so you can understand the library and make use of them instantly. – innicoder Nov 09 '17 at 12:55
+1 many HTML pages don't find the source code because they take time to load that's why scraper don't find the solutions, pretty simple and elegant solution , Thanks!!! – Dev Jan 05 '19 at 12:49

How do I scrape websites which don't return the source code using Python?

1 Answers1