1

I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM.

The simple code below works, but it's not always valid. I meant that most of time it would return an empty website which didn't execute the javascript. It could seldom get the correct info I want.

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://mall.pchome.com.tw/prod/QAAO6V-A9006XI59'
driver = webdriver.PhantomJS
driver.get(url)

print(driver.page_source, file=open('output.html','w'))

soup = BeautifulSoup(driver.page_source,"html5lib")
print(soup.select('#MetaDescription'))

It has a high probability to return an empty string :

[<meta content="" id="MetaDescription" name="description"/>]

Is the website server not allowing web crawlers? What can I do to fix my code?

What's more, all the info I need could be find in the <head> 's <meta>tag. (Like showing above, the data has an id MetaDescription)

Or is there any simpler way to just get the data in <head> tag?

WenT
  • 157
  • 8

1 Answers1

4

First of all, driver = webdriver.PhantomJS is not a correct way to initialize a selenium webdriver in Python, replace it with:

driver = webdriver.PhantomJS()

The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.PhantomJS()
driver.get(url)

# waiting for presence of an element
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#MetaDescription")))

print(driver.page_source, file=open('output.html','w'))

driver.close()

# further HTML parsing here

You may also need to ignore SSL errors and set the SSL protocol to any. In some cases, pretending not be PhantomJS helps as well.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks! I didn't consider the timing issues. However, the page has tags with empty content before loading. It uses javascript to fill in the blanks. So I used `Implicit Waits` to replace the waiting part. I also tried to use the two links at the same time, and it just succeed for few times. – WenT Aug 05 '16 at 20:01
  • @WenT okay, in that case it is just about choosing the right condition to wait for. For example, wait for presence of the product title: `#NickContainer`. Or, may be wait for the product `img` element to be present.. – alecxe Aug 05 '16 at 20:04
  • I used a `driver.implicitly_wait(30)` but in vain. However, I tried `time.sleep(5)` and it works!!! So it's actually cause by timing issues. But the `implicitly waits` didn't work for me, do I use this in wrong way? Whatever, thanks for giving directions!! – WenT Aug 05 '16 at 20:22
  • @WenT `time.sleep()` is not something you should use, it is highly unreliable and most of the time you end up waiting more than you should. Please continue picking the right condition for the `wait.until()`, thanks.. – alecxe Aug 05 '16 at 20:24