1

I'm working on some code in which I use Selenium web driver - Firefox. Most of things seems to work but when I try to change the browser to PhantomJS, It starts to behave differently.

The page I'm processing is needed to be scrolled slowly to load more and more results and that's probably the problem.

Here is the code which works with Firefox webdriver, but doesn't work with PhantomJS:

def get_url(destination,start_date,end_date): #the date is like %Y-%m-%d 
    return "https://www.pelikan.sk/sk/flights/listdfc=%s&dtc=C%s&rfc=C%s&rtc=%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=0&rbd=0&ct=0&view=list" % ('CVIE%20BUD%20BTS',destination, destination,'CVIE%20BUD%20BTS', start_date, end_date)



def load_whole_page(self,destination,start_date,end_date):
        deb()

        url = get_url(destination,start_date,end_date)

        self.driver.maximize_window()
        self.driver.get(url)

        wait = WebDriverWait(self.driver, 60)
        wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
        wait.until(EC.invisibility_of_element_located((By.XPATH,
                                                       u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
        i=0
        old_driver_html = ''
        end = False
        while end==False:
            i+=1

            results = self.driver.find_elements_by_css_selector("div.flightbox")
            print len(results)
            if len(results)>=__THRESHOLD__: # for testing purposes. Default value: 999
                break
            try:
                self.driver.execute_script("arguments[0].scrollIntoView();", results[0])
                self.driver.execute_script("arguments[0].scrollIntoView();", results[-1])            
            except:
                self.driver.save_screenshot('screen_before_'+str()+'.png')
                sleep(2)

                print 'EXCEPTION<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'
                continue 

            new_driver_html = self.driver.page_source
            if new_driver_html == old_driver_html:
                print 'END OF PAGE'
                break
            old_driver_html = new_driver_html

            wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results)))
        sleep(10)

To detect when the page is full loaded, I compare old copy of html and new html which is probably not what I'm supposed to do but with Firefox it is sufficient.

Here is the screen of PhantomJS when the loading is stopped:enter image description here

With Firefox, it loads more and more results, but with PhantomJS it is stucked on for example 10 results.

Any ideas? What are the differences between these two drivers?

Milano
  • 18,048
  • 37
  • 153
  • 353

1 Answers1

2

Two key things that helped me to solve it:

  • do not use that custom wait I've helped you with before
  • set the window.document.body.scrollTop first to 0 and then to document.body.scrollHeight in a row

Working code:

results = []
while len(results) < 200:
    results = driver.find_elements_by_css_selector("div.flightbox")

    print len(results)

    # scroll
    driver.execute_script("arguments[0].scrollIntoView();", results[0])
    driver.execute_script("window.document.body.scrollTop = 0;")
    driver.execute_script("window.document.body.scrollTop = document.body.scrollHeight;")
    driver.execute_script("arguments[0].scrollIntoView();", results[-1])

Version 2 (endless loop, stop if there is nothing loaded on scroll anymore):

results = []
while True:
    try:
        wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, "div.flightbox"), len(results)))
    except TimeoutException:
        break

    results = self.driver.find_elements_by_css_selector("div.flightbox")
    print len(results)

    # scroll
    for _ in xrange(5):
        try:
            self.driver.execute_script("""
                arguments[0].scrollIntoView();
                window.document.body.scrollTop = 0;
                window.document.body.scrollTop = document.body.scrollHeight;
                arguments[1].scrollIntoView();
            """, results[0], results[-1])
        except StaleElementReferenceException:
            break  # here it means more results were loaded

print "DONE. Result count: %d" % len(results)

Note that I've changed the comparison in the wait_for_more_than_n_elements expected condition. Replaced:

return count >= self.count

with:

return count > self.count

Version 3 (scrolling from header to footer multiple times):

header = wait.until(EC.visibility_of_element_located((By.TAG_NAME, 'header')))
footer = wait.until(EC.visibility_of_element_located((By.TAG_NAME, 'footer')))

results = []
while True:
    try:
        wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, "div.flightbox"), len(results)))
    except TimeoutException:
        break

    results = self.driver.find_elements_by_css_selector("div.flightbox")
    print len(results)

    # scroll
    for _ in xrange(5):
        self.driver.execute_script("""
            arguments[0].scrollIntoView();
            arguments[1].scrollIntoView();
        """, header, footer)
        sleep(1)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • It doesn't work for me. I've tried to use your code: http://pastebin.com/tHBQu67i ERROR: http://pastebin.com/W8ktFaUR It's something with the last line but I thinf results[-1] must exists because it says that there are 10 or 15 results... – Milano Jul 13 '15 at 10:53
  • 1
    @Milan okay, which `PhantomJS` version are you using? Try ugrading if not the latest. Thanks. – alecxe Jul 13 '15 at 12:44
  • I've checked and it is probably the newest version - 2.0.0 – Milano Jul 13 '15 at 12:57
  • Now, I've tried to replace it with 2.0.0 which is probably the same and it seemed to work, just printing 5 5 5 5 5 5 10 10 10 10 10 etc. But suddenly this error raised in new searching self.load_whole_page(destination, date_arrival, date_return) File "C:\Users\Milano\My Documents\LiClipse Workspace\Pelikan_bot\pelikan.py", line 158, in load_whole_page self.driver.execute_script("arguments[0].scrollIntoView();", results[0]) IndexError: list index out of range – Milano Jul 13 '15 at 13:00
  • And the second problem is that I have to check the last result in results which means that I can't have this condition (len(results)<200) in a while loop so I've tried to put there comparing old and new source code of the page but it stops at the beginning. – Milano Jul 13 '15 at 13:15
  • 1
    @Milan well, correct me if I'm wrong: you basically want to scroll the list of results until there are N results and then get the page source and parse it with `BeautifulSoup`? If yes, then checking the `len(results) to be >= n` as your while loop break should be good enough. – alecxe Jul 13 '15 at 13:18
  • No no, I have to check all results because I need the last (most expensive) price. So the point is that I have to scroll the page at the absolute bottom of it. This is all I want. When I have the page scrolled at the very bottom, I have source code of the whole page and the next thing is not a problem (to parse the page etc...) – Milano Jul 13 '15 at 13:20
  • The simillar error appeared and I don't know why. self.driver.execute_script("arguments[0].scrollIntoView();", results[-1]) Like there were no results but there is 10 or even 15 sometimes in the console before this exception appears. – Milano Jul 14 '15 at 09:54
  • Thank you for your help. Now, after changing a version of PhantomJS it works. I have one more question. How could I change a timeout in wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, "div.flightbox"), len(results))) Function? – Milano Jul 15 '15 at 10:40
  • 1
    @Milan initialize a new `WebdriverWait` instance before the loop and use instead of `wait`. Glad we finally solved it :) – alecxe Jul 15 '15 at 10:42