22

I have succeeded in getting Python with Selenium and PhantomJS to reload a dynamically loading infinite scrolling page, like in the example below. But how could this be modified so that instead of setting a number of reloads manually, the program stopped when reaching rock bottom?

reloads = 100000 #set the number of times to reload
pause = 0 #initial time interval between reloads
driver = webdriver.PhantomJS()

# Load Twitter page and click to view all results
driver.get(url)
driver.find_element_by_link_text("All").click()

# Keep reloading and pausing to reach the bottom
for _ in range(reloads):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause)

text_file.write(driver.page_source.encode("utf-8"))
text_file.close()
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46

1 Answers1

35

You can check whether the scroll did anything in every step.

lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight

This uses a static wait amount which is bad because you don't want to wait unnecessary when it finishes faster and you don't want that the script exits prematurely when the dynamic load is too slow for some reason.

Since a page usually loads some more elements into a list, you can check the length of the list before the load and wait until the next element is loaded.

For twitter this could look like this:

while True:
    elemsCount = browser.execute_script("return document.querySelectorAll('.stream-items > li.stream-item').length")

    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    try:
        WebDriverWait(browser, 20).until(
            lambda x: x.find_element_by_xpath(
                "//*[contains(@class,'stream-items')]/li[contains(@class,'stream-item')]["+str(elemsCount+1)+"]"))
    except:
        break

I used an XPath expression, because PhantomJS 1.x has a bug sometimes when using :nth-child() CSS selectors.

Full version for reference.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • Also with the Firefox web driver, both heights print as 'None' – Simon Lindgren Mar 08 '15 at 17:36
  • Sorry, forgot the `return` and moved the sleep to the correct position. – Artjom B. Mar 08 '15 at 19:52
  • The script seems to be outdated as gridtimeline-items and grid classes are nonexistent. I have trouble adjusting it, could you point me in the right direction? – Jens de Bruijn May 04 '16 at 14:22
  • 1
    @JensdeBruijn If you're still having trouble, I've fixed the scripts now. (Took a little longer than expected) – Artjom B. May 15 '16 at 13:13
  • 1
    @yome I can't help you there. I don't use PhantomJS anymore. The code in this answer was tested with PhantomJS 1.9.x. I don't know if it still works in the same way with v2.1.1 and I don't have the time or desire to find out. At least there should not be any change apart from html changes in the twitter page. I've seen your last three questions. The issue I see with all of them is that you say that it doesn't work, but you don't provide any kind of indication that you tried to debug it. How does the page change on loop iteration? Are there any errors? Have you taken screenshots? – Artjom B. Jul 20 '17 at 21:25
  • counting the elements does not works on twitter since it seems to change the html as you scroll, i tried it – munish Jul 28 '21 at 21:38