1

I'm fighting with unexpected behaviour in a Selenium-based Python 3 web scraper and want to understand what's going on:

I'm parsing sites with job offerings. After the initial search I get 1 to n sites of results. This number of sites is shown on the very first page as the text part of the "m-pagination__meta" element and is shown in German e.g. "1 von 48". I need this string for further processing. It's on the site, it is NOT part of an iframe.

Sample link of job website

The HTML:

<div class="m-pagination">
  <div class="m-pagination__inner m-pagination__inner--borderBottom">
    <button class="m-pagination__button m-pagination__button--disabled" data-page="" data-event-action="click: pagination-first">
      <svg viewBox="0 0 17 17" width="0" height="0" class="m-icon m-icon--large ">
        <g fill="none" stroke="currentColor" stroke-width=".7" stroke-linecap="round" stroke-linejoin="round">
          <path d="M9 13.2L4.2 8.5 9 3.8"></path>
          <path d="M12.8 13.2L8 8.5l4.7-4.7"></path>
        </g>
      </svg>
    </button>
    <button class="m-pagination__button m-pagination__button--previous m-pagination__button--disabled" data-page="false" data-event-action="click: pagination-previous">
      <svg viewBox="0 0 17 17" width="0" height="0" class="m-icon m-icon--large ">
        <path fill="none" stroke="currentColor" stroke-width=".8" stroke-linecap="round" stroke-linejoin="round" d="M10.9 3.8L6 8.6l4.7 4.6"></path>
      </svg>
    </button>
    <span class="m-pagination__meta" data-number="1"> 1 von 43 </span> 
    <button class="m-pagination__button m-pagination__button--next m-pagination__button--available" data-page="2" data-event-action="click: pagination-next">
      <svg viewBox="0 0 17 17" width="0" height="0" class="m-icon m-icon--large ">
        <path fill="none" stroke="currentColor" stroke-width=".7" stroke-linecap="round" stroke-linejoin="round" d="M6.1 3.8L11 8.6l-4.7 4.6"></path>
      </svg>
    </button>
  </div>
</div>

Now comes the weird part: When I debug the program and try to access the string element directly with "m-pagination__meta".text it returns an empty string.

Yet, when I access the mother element object m-pagination__meta and inspect it with the debugger, scrolling down to the text property the expected "1 von 48" string is there. After this inspection I CAN access "m-pagination__meta".text with the expected results.

This behaviour seems not to be dependent on timing. I tried to wait for the presence of the required element with code like

wait = WebDriverWait(self.driver, 10)
wait.until(EC.text_to_be_present_in_element((By.CLASS_NAME,"m-pagination__meta"), "1 von 48"))
pagesTotal = int(self.driver.find_element_by_class_name("m-pagination__meta").text.split(" ")[2])

to no avail (of course, I realized it's stupid to search for a specific string when I don't know which one it will yield, but I didn't know how else to code it.)

I also tried "normal" waits using sleep, but nothing seems to work, only the mentioned inspection in the debugger, which is useless for production purposes.

I would really like to understand what is going on here.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
SuperSpitter
  • 177
  • 2
  • 8
  • Can you post a snippet of the HTML that contains the text you are looking for? – Greg Burghardt Jun 21 '19 at 12:06
  • You know what, it looks like the page loads as you scroll down. That's why Selenium isn't finding anything until you run the debugger. In the debugger, you scroll down to the element, which causes the rest of the page to load, and then Selenium can find the element. – Greg Burghardt Jun 21 '19 at 12:50
  • No, the element is there without scrolling. I just checked ... – Greg Burghardt Jun 21 '19 at 12:51

3 Answers3

3

There is vertical scroll bar present which divide the page in two sections.However you need to find the left hand scroll bar element first and then do location_once_scrolled_into_view.Once you reach that you can identify the element you are after.

Try the below code.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver=webdriver.Chrome()
driver.get('https://www.karriere.at/jobs/programmierer/wien')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='c-jobsSearch__listing']"))).location_once_scrolled_into_view
print(driver.find_element_by_xpath("//span[@class='m-pagination__meta']").get_attribute('innerText'))
KunduK
  • 32,888
  • 5
  • 17
  • 41
1

The problem might be due to the element appearing in the HTML source when the page loads, but JavaScript fills in the value behind the scenes.

You can use a regular expression in XPath to match the text based on a pattern:

xpath = '//*[contains(@class, "m-pagination__meta") and matches(normalize-space(text()), "\d+ von \d+")]'
wait = WebDriverWait(self.driver, 30)
wait.until(EC.element_to_be_present(By.XPATH, xpath))

Note: Maybe increase the wait period to 30 seconds too, just to be safe.

Greg Burghardt
  • 17,900
  • 9
  • 49
  • 92
0

You seem to be pretty close with WebDriverWait. But unfortunately, the element is located way down the DOM Tree and is not with in the Viewport. Hence empty string is returned.


Solution

The solution would be to scroll() the element within the Viewport once the element is visible within the HTML DOM using EC as visibility_of_element_located() and then you can attempt to extract the desired text and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.m-pagination__meta"))))
    print(driver.find_element_by_css_selector("span.m-pagination__meta").get_attribute("innerHTML"))
    
  • Using XPATH:

    driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='m-pagination__meta']"))))
    print(driver.find_element_by_xpath("//span[@class='m-pagination__meta']").get_attribute("innerHTML"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Many thanks for the snippets and the explanations! I marked the answer below as right though just because it was the only one where the code worked out of the box. It was hard to decide, a pity only one answer can be marked as right. – SuperSpitter Jun 24 '19 at 08:52