Problem
I'm scraping a dynamic page - one that appears to be loading results from a database for display - but, it appears, I'm only getting the placeholder for the text elements rather than the text itself. The page I'm loading is:
https://www.bricklink.com/v2/search.page?q=8084#T=S
Expected / Actual
Expected:
<table>
<tr>
<td class="pspItemClick">
<a class="pspItemNameLink" href="www.some-url.com">The Name</a>
<br/>
<span class="pspItemCateAndNo">
<span class="blcatList">Catalog Num</span> : 1111
</span>
</td>
</tr>
</table
Actual
<table>
<tr>
<td class="pspItemClick">
<a class="pspItemNameLink" href="[%catalogUrl%]">[%strItemName%]</a>
<br/>
<span class="pspItemCateAndNo">
<span class="blcatList">[%strCategory%]</span> : [%strItemNo%]
</span>
</td>
</tr>
</table
Attempted Solutions
- I first just tried loading the site using the
requests
library which, of course, didn't work since it's not a static page.
def load_page(url: str) -> BeautifulSoup:
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
req = requests.get(url, headers=headers)
return BeautifulSoup(req.content, 'html.parser')
- I then tried Selenium's
webdriver
to load the dynamic content:
def html_source_from_webdriver(url: str, wait: int = 0) -> BeautifulSoup:
browser = webdriver.Chrome(service=selenium_chrome_service, chrome_options=options)
browser.implicitly_wait(wait)
browser.get(urljoin(ROOT_URL, url))
page_source = browser.page_source
return BeautifulSoup(page_source, features="html.parser")
Both attempts yield the same results. I haven't used the implicitly_wait
feature much so I was just experimenting with different values (0-15) - none of which worked. I've also tried the browser.set_script_timeout(<timeout>)
which also did not work.
Any thoughts on where to go from here would be greatly appreciated.
Update
I appreciate those of you providing suggestions. I've also tried the following with no luck:
- using
time.sleep()
- added after thebrowser.get(...)
call. - using
browser.set_page_load_timeout()
- didn't expect this one to work, but tried anyway.