I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper using scrapy to crawl site with infinite scrolling. After digging around I found out that there is a package call selenium and it has python module. I have a feeling someone has already done that using Scrapy and Selenium to scrape site with infinite scrolling. It would be great if someone can point towards to an example.
-
A way to do that is to trigger some down arrow keys to make your browser scroll down. – donfuxx Mar 28 '14 at 01:01
-
2Take a look: http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page – alecxe Mar 28 '14 at 01:03
5 Answers
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.
Step 3 : Print the data if required.

- 4,984
- 9
- 37
- 62

- 546
- 6
- 6
This is short & simple code which is working for me:
SCROLL_PAUSE_TIME = 20
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
posts = driver.find_elements_by_class_name("post-text")
for block in posts:
print(block.text)

- 492
- 7
- 15
-
It would be helpful to add all needed includes and definitions (e.g. of `driver`) to your script – bixel Feb 21 '20 at 10:46
-
I am using this code, but it return just last elements of scrolling, not every element in the page – Sht Apr 03 '21 at 20:34
from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.something.com")
lastElement = driver.find_elements_by_id("someId")[-1]
lastElement.send_keys(Keys.NULL)
This will open a page, find the bottom-most element with the given id
and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call to driver.find_element_*
because I don't know of a way to explicitly query the last element in the page.
Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to driver.find_element_*
.

- 2,275
- 1
- 19
- 25
For infinite scrolling data are requested to Ajax calls. Open web browser --> network_tab --> clear previous requests history by clicking icon like stop--> scroll the webpage--> now you can find the new request for scroll event--> open the request header --> you can find the URL of request ---> copy and paste URL in an seperare tab--> you can find the result of Ajax call --> just form the requested URL to get the data page until end of the page

- 31
- 1
-
I'm agreed, to my experience web page automation is never an optimal way of implementing crawlers. – Kamoo Oct 23 '19 at 06:37
Great Question!
The Challange
When working with an infinite scroll page (or dynamically loading site), there's no way to really know how long new items will take to load, as such it is hard to know how long to wait before new items load and we can hit page-down
.
Additionally, even if we can solve the first problem, we want to make sure that we're scrolling enough to actually reach the bottom of the page, so we want hit page-down enough times to actually reach the bottom of the page.
TLDR;If the site isn't so fast or for whatever reason data takes a while to load, we don't want to exit too early.
My Solution
- First, define a
scroll_down
function which takes a driver and a positive integern
as input. - The function contains a
for-loop
which hits page downn
times waiting .01 seconds (this can be changed) between page-downs - Store the current window height in a variable named
prev_height
- Within a
for-loop
utilize a predefined function to scroll down. - Within each iteration, take a significant pause allowing more items to load (I waited 10 seconds)
- After the pause, compare
prev_height
with the current height. If they are the same, then exit, otherwise continue.
Code
Scroll function:
def scroll_down(elem, num):
for _ in range(num):
time.sleep(.01)
elem.send_keys(Keys.PAGE_DOWN)
Main code:
driver = <load driver etc.>
SCROLL_PAUSE_TIME = 10
elem = driver.find_element_by_tag_name("body")
prev_height = elem.get_attribute("scrollHeight")
for i in range(0, 500):
# note that the pause between page downs is only .01 seconds
# in this case that would be a sum of 1 second waiting time
scroll_down(elem,100)
# Wait to allow new items to load
time.sleep(SCROLL_PAUSE_TIME)
#check to see if scrollable space got larger
#also we're waiting until the second iteration to give time for the initial loading
if elem.get_attribute("scrollHeight") == prev_height and i > 0:
break
prev_height = elem.get_attribute("scrollHeight")
Note: The actual numbers I used within my program may not work for you. But I do believe the solution itself is a reliable approach. Additionally, while the solution has been quite reliable for me, it is also one which takes time.

- 544
- 11
- 20