2

I am trying to scrape the following Javascript frontend website to practise my Javascript scraping skills: https://www.oplaadpalen.nl/laadpaal/112618

I am trying to find two different elements by their xPath. The first one is the title, which it does find. The second one is the actual text itself, which it somehow fails to find. It's strange since I just copied the xPath's from Chrome browser.

from selenium import webdriver

link = 'https://www.oplaadpalen.nl/laadpaal/112618'
driver = webdriver.PhantomJS()
driver.get(link)

#It could find the right element
xpath_attribute_title = '//*[@id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/label'
next_page_elem_title = driver.find_element_by_xpath(xpath_attribute_title)
print(next_page_elem_title.text)

#It fails to find the right element
xpath_attribute_value = '//*[@id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/text()'
next_page_elem_value = driver.find_element_by_xpath(xpath_attribute_value)
print(next_page_elem_value.text)

I have tried a couple of things: change "text()" into "text", "(text)", but none of them seem to work.

I have two questions:

  • Why doesn't it find the correct element?
  • What can we do to make it find the correct element?
Chiel
  • 662
  • 1
  • 7
  • 30
  • you could replace the statement by xpath_attribute_value = '//*[@id="main-sidebar-container"]/div/div[1]/div[2]/div/div[3]/text()' – Chiel Feb 14 '18 at 13:52
  • str() was just to cast 3 as a number to a string. – Chiel Feb 14 '18 at 13:53

5 Answers5

2

Selenium's find_element_by_xpath() method returns the first element node matching the given XPath query, if any. However, XPath's text() function returns a text node—not the element node that contains it.

To extract the text using Selenium's finder methods, you'll need to find the containing element, then extract the text from the returned object.

Ian Lesperance
  • 4,961
  • 1
  • 26
  • 28
  • 1
    Too bad I can only accept one answer. I think you gave a nice explanation to the why question. – Chiel Feb 15 '18 at 08:31
2

Keeping your own logic intact you can extract the labels and the associate value as follows :

for x in range(3, 8):
    label = driver.find_element_by_xpath("//div[@class='labels']//following::div[%s]/label" %x).get_attribute("innerHTML")
    value = driver.find_element_by_xpath("//div[@class='labels']//following::div[%s]" %x).get_attribute("innerHTML").split(">")[2]
    print("Label is %s and value is %s" % (label, value))

Console Output :

Label is Paalcode: and value is NewMotion 04001157
Label is Adres: and value is Deventerstraat 130
Label is pc/plaats: and value is 7321cd Apeldoorn
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
1

I would suggest a slightly different approach. I would grab the entire text and then split one time on :. That will get you the title and the value. The code below will get Paalcode through openingstijden labels.

for x in range(2, 8):
    s = driver.find_element_by_css_selector("div.leftblock > div.labels > div")[x].text
    t = s.split(":", 1)
    print(t[0]) # title
    print(t[1]) # value

You don't want to split more than once because Status contains more semicolons.

JeffC
  • 22,180
  • 5
  • 32
  • 55
1

Going with @JeffC's approach, if you want to first select all those elements using xpath instead of css selector, you may use this code:

xpath_title_value = "//div[@class='labels']//div[label[contains(text(),':')] and not(div) and not(contains(@class,'toolbox'))]"
title_and_value_elements = driver.find_elements_by_xpath(xpath_title_value)

Notice the plural elements in the find_elements_by_xpath method. The xpath above selects div elements that are descendants of a div element that had a class attribute of "labels". The nested label of each selected div must contain a colon. Furthermore, the div itself may not have a class of "toolbox" (Something that certain other divs on the page have), nor must it contain any additional nested divs.

Following which, you can extract the text within the individual div elements (which also contain the text from the nested label elements) and then split them using ":\n" which separates the title and value in the raw text string.

for element in title_and_value_elements:
    element = element.text
    title,value = element.split(":\n")
    print(title)
    print(value,"\n")
Farhaan S.
  • 104
  • 5
  • I ended up using the this approach for selecting the xPath: `xpath_attribute_title_and_value = '//*[@id="main-sidebar-container"]/div/div[1]/div[2]/div/div[3]'`, but your approach worked as well. – Chiel Feb 15 '18 at 08:45
1

Since you want to practice JS skills you can do this also in JS, actually all the divs contain more data, you can see if you do paste this in the browser console:

labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) console.log(div.firstChild, div.textContent); 

you can push to an array and check only divs and that have label and return the resulted array in a python variable:

labels_value_pair.driver.execute_script('''
scrap = [];
labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) if (div.firstChild.tagName==="LABEL") scrap.push(div.firstChild.textContent, div.textContent); 
return scrap;
''')
Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179