2

I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code

import time
from selenium import webdriver
from bs4 import BeautifulSoup

user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")

# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')

browser.save_screenshot('screen.png') # save a screenshot to disk

# Summit the login
element.submit()
time.sleep(5

browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk

scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk

The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?

F.N.B
  • 1,539
  • 6
  • 23
  • 39
  • Can you add a `time.sleep()` after your `scroll_script`? Maybe it needs to render after scroll. –  Nov 01 '16 at 23:48
  • I try with an `time.sleep(5)`, but don't work. – F.N.B Nov 02 '16 at 01:14
  • Can you hardcode the height to `10000` and see if it scrolls? Set `scroll_script` as `window.scrollTo(0, 1000)` and nothing else. –  Nov 02 '16 at 01:17
  • If I you Firefox as driver, work. But if I use PhantonJS doesn't work. I need use PhantonJS because I going to run this script in a server without graphic interface. – F.N.B Nov 02 '16 at 13:19

1 Answers1

2

I was able to get this to work in phantomJS when trying to solve a similar problem:

check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    height = driver.execute_script("return document.body.scrollHeight;")
    if height == check_height:
        break
    check_height = height

It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)

In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.

Also, this is the answer I used as an example

  • 1
    Great solution. Should be updated at some point though as two of the 3 execute-statements lack a semicolon at the end. If you arrive here and don't know much about javascript that'll probably take you a while to figure out on your own. – jlaur Aug 07 '17 at 07:36