Scraping web data using PhantomJS and Selenium

Question

I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.

I can only run using headless browser since I am running it in AWS Linux server

how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance

Below is the snippet attached

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
    driver.find_element_by_class_name("size-buttons-show-size-chart").click()
    driver.implicitly_wait(10)
    div_s = driver.find_elements_by_class_name("size-chart-cell")
    # div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
    size_data = ''
    for s in div_s:
        print str(s.text)
except NoSuchElementException:
    print "NoSuchElementException"

Modified output:

Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25

Actual table is :

Maybe the waiting time is too short. Try to `driver.implicitly_wait(30)` — Guandan Chen, Dec 28 '16 at 14:11
I have already tried with this... and this is not my question — Dinu Duke, Dec 28 '16 at 14:13

score 1 · Accepted Answer · edited May 23 '17 at 10:30

1

Interesting problem. Using the textContent would actually work in this case:

for s in div_s:
    print(str(s.get_attribute("textContent")))

Differences between .text, textContent and other properties are perfectly described here:

Note that there is no point in calling the implicitly_wait() multiple times - it does not act as time.sleep() - meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:

An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.

A better way to wait in this case would be to use Explicit Waits.

edited May 23 '17 at 10:30

Community

1
1

answered Dec 28 '16 at 14:17

alecxe

462,703
120
1,088
1,195

@DineshSingh still not sure why `.text` was not able to retrieve the text of several cells. The table itself looks pretty normal - all the `td` elements have text nodes and they are not different from each other. Guess this is pretty much `PhantomJS` specific.. – alecxe Dec 28 '16 at 14:50
I also noticed that earlier. But in Chrome driver it is scraping absolutely fine and no text in cell where missing.... Need to what is the seen behind this ....? – Dinu Duke Dec 28 '16 at 14:56

score 0 · Answer 2 · edited May 23 '17 at 11:55

I think i found the answer/reason behind it.

Thanks for your replay @alecxe i found my answer here...

The textContent property is "inhertied" from the Node interface of the DOM Core specification. The text property is "inherited" from the HTML5 HTMLAnchorElement interface and is specified as "must return the same value as the textContent IDL attribute".

The two are probably retained to converge different browser behaviour, the text property for script elements is defined slightly differently.

Note that the DOM specification is a general specification for any kind of document (e.g. HTML, XML, SGML, etc.) whereas HTML5 is specifically for HTML that leverages and extends the DOM Core in many respects (some might say it's a "super set" of a few DOM specs plus HTML plus …).

Note that "inherited" does not mean "prototype inheritance", just the more general meaning of inherited

Again Thank you for this...

Difference between text and textContent properties

Scraping web data using PhantomJS and Selenium

2 Answers2