Python crawler to get DOM info by using Selenium and PhantomJS

Question

I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM.

The simple code below works, but it's not always valid. I meant that most of time it would return an empty website which didn't execute the javascript. It could seldom get the correct info I want.

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://mall.pchome.com.tw/prod/QAAO6V-A9006XI59'
driver = webdriver.PhantomJS
driver.get(url)

print(driver.page_source, file=open('output.html','w'))

soup = BeautifulSoup(driver.page_source,"html5lib")
print(soup.select('#MetaDescription'))

It has a high probability to return an empty string :

[<meta content="" id="MetaDescription" name="description"/>]

Is the website server not allowing web crawlers? What can I do to fix my code?

What's more, all the info I need could be find in the <head> 's <meta>tag. (Like showing above, the data has an id MetaDescription)

Or is there any simpler way to just get the data in <head> tag?

`soup.select('head')`? Anyway, have you tried to wait a little? — Artjom B., Aug 05 '16 at 18:40

score 4 · Accepted Answer · edited Dec 26 '21 at 06:45

4

First of all, driver = webdriver.PhantomJS is not a correct way to initialize a selenium webdriver in Python, replace it with:

driver = webdriver.PhantomJS()

The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.PhantomJS()
driver.get(url)

# waiting for presence of an element
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#MetaDescription")))

print(driver.page_source, file=open('output.html','w'))

driver.close()

# further HTML parsing here

You may also need to ignore SSL errors and set the SSL protocol to any. In some cases, pretending not be PhantomJS helps as well.

edited Dec 26 '21 at 06:45

Glorfindel

21,988
13
81
109

answered Aug 05 '16 at 18:59

alecxe

462,703
120
1,088
1,195

Thanks! I didn't consider the timing issues. However, the page has tags with empty content before loading. It uses javascript to fill in the blanks. So I used `Implicit Waits` to replace the waiting part. I also tried to use the two links at the same time, and it just succeed for few times. – WenT Aug 05 '16 at 20:01
@WenT okay, in that case it is just about choosing the right condition to wait for. For example, wait for presence of the product title: `#NickContainer`. Or, may be wait for the product `img` element to be present.. – alecxe Aug 05 '16 at 20:04
I used a `driver.implicitly_wait(30)` but in vain. However, I tried `time.sleep(5)` and it works!!! So it's actually cause by timing issues. But the `implicitly waits` didn't work for me, do I use this in wrong way? Whatever, thanks for giving directions!! – WenT Aug 05 '16 at 20:22
@WenT `time.sleep()` is not something you should use, it is highly unreliable and most of the time you end up waiting more than you should. Please continue picking the right condition for the `wait.until()`, thanks.. – alecxe Aug 05 '16 at 20:24

Python crawler to get DOM info by using Selenium and PhantomJS

1 Answers1