5

Why is it when I add time.sleep(2), I get my desired output but if I add wait until specific xpath it gives less results?

Output with time.sleep(2) (also desired):

Adelaide Utd
Tottenham
Dundee Fc
 ...

Count: 145 names

Remove time.sleep

Adelaide Utd
Tottenham
Dundee Fc
 ...

Count: 119 names

I have added:

clickMe = wait(driver,    13).until(EC.element_to_be_clickable((By.CSS_SELECTOR,    ("#page-container > div:nth-child(4) > div >    div.ubet-sports-section-page > div > div:nth-child(2) > div > div >    div:nth-child(1) > div > div > div.page-title-new > h1"))))

As this element is present on all pages.

Seems to be significantly less. How can I get around this issue?

Script:

import csv
import os

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait as wait


driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()

driver.get('https://ubet.com/sports/soccer')



clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//select[./option="Soccer"]/option'))))

options = driver.find_elements_by_xpath('//select[./option="Soccer"]/option')


indexes = [index for index in range(len(options))]
for index in indexes:


    try:
        try:
            zz = wait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, '(//select/optgroup/option)[%s]' % str(index + 1))))
            zz.click()
        except StaleElementReferenceException:
            pass

        from selenium.webdriver.support.ui import WebDriverWait
        def find(driver):
            pass

        from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
        import time
        clickMe = wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ("#page-container > div:nth-child(4) > div > div.ubet-sports-section-page > div > div:nth-child(2) > div > div > div:nth-child(1) > div > div > div.page-title-new > h1"))))

        langs0 = driver.find_elements_by_css_selector(
            "div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(1) > div > div > div > div.lbl-offer > span")
        langs0_text = []

        for lang in langs0:
            try:
                langs0_text.append(lang.text)
            except StaleElementReferenceException:
                pass


        directory = 'C:\\A.csv' #####################################
        with open(directory, 'a', newline='', encoding="utf-8") as outfile:
            writer = csv.writer(outfile)
            for row in zip(langs0_text):
                writer.writerow(row)
    except StaleElementReferenceException:
        pass

If you cannot access page, you need vpn.

Updating...

Perhaps that element loads before others. So if we changed it to datascraped (not all pages have data to be scraped).

Add:

try:

    clickMe = wait(driver, 13).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ("div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(3) > div > div > div > div.lbl-offer > span"))))
except TimeoutException as ex:
    pass

Same issue still present

Manual steps:

#Load driver.get('https://ubet.com/sports/soccer')
#Click drop down (//select/optgroup/option
#Wait for page elements so can scrape
Scrape:

    div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(1) > div > div > div > div.lbl-offer > span
Loop repeat.  
  • Which is your goal? If you only need to extract the data, cannot you extract it directly from the api links requested by the page (like https://ubet.com/api/sportsViewData/nexttoplay/false/3)? – Stefano P. Jan 02 '18 at 11:03

2 Answers2

6

The website is built on angularjs, so your best bet would be to wait until angular has finished processing of all AJAX requests (I won't go into the underlying mechanics, but there are plenty of materials on that topic throughout the web). For this, I usually define a custom expected condition to check while waiting:

class NgReady:

    js = ('return (window.angular !== undefined) && '
          '(angular.element(document).injector() !== undefined) && '
          '(angular.element(document).injector().get("$http").pendingRequests.length === 0)')

    def __call__(self, driver):
        return driver.execute_script(self.js)

# NgReady does not have any internal state, so one instance 
# can be reused for waiting multiple times
ng_ready = NgReady()

Now use it to wait after zz.click():

zz.click()
wait(driver, 10).until(ng_ready)

Tests

  1. Your original code, unmodified (without sleeping or waiting with ng_ready):

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    86 out.csv
    
  2. Using time.sleep(10) after zz.click():

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    101 out.csv
    
  3. Same result when using wait(driver, 10).until(ng_ready) after zz.click():

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    101 out.csv
    

Credits

NgReady is not my invention, I just ported it to python from the expected condition implemented in Java I found here, so all credits go to the author of the answer.

hoefling
  • 59,418
  • 12
  • 147
  • 194
4

@hoefling idea is absolutely the correct one, but here is an addition to the "wait for Angular" part.

The logic used inside the NgReady only checks for angular to be defined and no pending requests left to be processed. Even though it works for this website, it's not a definite answer to the question of Angular being ready to work with.

If we look at what Protractor - the Angular end-to-end testing framework - does to "sync" with Angular, it is using this "Testability" API built into Angular.


There is also this pytractor package which extends selenium webdriver instances with a WebDriverMixin which would keep the sync between the driver and angular automatically on every interaction.

You can either start using pytractor directly (it is though abandonded as a package). Or, we can try and apply the ideas implemented there in order to always keep our webdriver synced with Angular. For that, let's create this waitForAngular.js script (we'll use only Angular 1 and 2 support logic only - we can always extend it by using the relevant Protractor's client side script):

try { return (function (rootSelector, callback) {
  var el = document.querySelector(rootSelector);
  try {
    if (!window.angular) {
      throw new Error('angular could not be found on the window');
    }

    if (angular.getTestability) {
      angular.getTestability(el).whenStable(callback);
    } else {
      if (!angular.element(el).injector()) {
        throw new Error('root element (' + rootSelector + ') has no injector.' +
           ' this may mean it is not inside ng-app.');
      }
      angular.element(el).injector().get('$browser').
          notifyWhenNoOutstandingRequests(callback);
    }
  } catch (err) {
    callback(err.message);
  }
}).apply(this, arguments); }
catch(e) { throw (e instanceof Error) ? e : new Error(e); }

Then, let's inherit from webdriver.Chrome and patch the execute() method - so that every time there is an interaction, we additionally check if Angular is ready before the interaction:

import csv

from selenium import webdriver
from selenium.webdriver.remote.command import Command
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC


COMMANDS_NEEDING_WAIT = [
    Command.CLICK_ELEMENT,
    Command.SEND_KEYS_TO_ELEMENT,
    Command.GET_ELEMENT_TAG_NAME,
    Command.GET_ELEMENT_VALUE_OF_CSS_PROPERTY,
    Command.GET_ELEMENT_ATTRIBUTE,
    Command.GET_ELEMENT_TEXT,
    Command.GET_ELEMENT_SIZE,
    Command.GET_ELEMENT_LOCATION,
    Command.IS_ELEMENT_ENABLED,
    Command.IS_ELEMENT_SELECTED,
    Command.IS_ELEMENT_DISPLAYED,
    Command.SUBMIT_ELEMENT,
    Command.CLEAR_ELEMENT
]


class ChromeWithAngular(webdriver.Chrome):
    def __init__(self, root_element, *args, **kwargs):
        self.root_element = root_element

        with open("waitForAngular.js") as f:
            self.script = f.read()

        super(ChromeWithAngular, self).__init__(*args, **kwargs)

    def wait_for_angular(self):
        self.execute_async_script(self.script, self.root_element)

    def execute(self, driver_command, params=None):
        if driver_command in COMMANDS_NEEDING_WAIT:
            self.wait_for_angular()
        return super(ChromeWithAngular, self).execute(driver_command, params=params)


driver = ChromeWithAngular(root_element='body')

# the rest of the code as is with what you had 

Again, this is heavily insipred by the pytractor and protractor projects.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Upvoting! I would be happy to see an ultimate solution that covers the angular case as pages built on angular have become very common these days. Would this work for both angularJS and angular 2/4 or only for the legacy one? – hoefling Dec 29 '17 at 00:41
  • @hoefling yeah, `pytractor` was a promising project and it's very unfortunate it got abandoned. I've edited the answer and included one option, but still having difficulties making it reliably working locally - hope to improve and post a more complete solution. Thanks. – alecxe Dec 29 '17 at 06:08