-1

There is a web page that I want to run my scraping script on. However, because the page refreshes with additional content when you scroll down, I need to be able to add a function to my script that scrolls the web page all the way to the bottom before my scraping script is run.

In attempt to achieve this, please find my entire script which seems to stop at row height 5287.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv
import pandas as pd  
   
#Initialize a Chrome browser
driver = webdriver.Chrome("C:.............chromedriver_win32/chromedriver.exe")

#Go to the page we want to scrape
driver.get('https://icodrops.com/category/ended-ico/')

#Open csv file to write in 
csv_file = open('icodrops_ended_icos.csv', 'w')
writer = csv.writer(csv_file)
writer.writerow(['Project_Name', 'Interest', 'Category', 'Received', 'Goal', 'End_Date', 'Ticker'])

page_url = 'https://icodrops.com/category/ended-ico/'
# Although only one page to scrape - need to scroll to the bottom to pull all data
lastHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('lastHeight', lastHeight)
while True: 

    driver.execute_script(f"window.scrollTo(0, {lastHeight});")
    time.sleep(15)
    #height = driver.execute_script("return document.documentElement.scrollHeight")
    newHeight = driver.execute_script("return document.documentElement.scrollHeight")
    print('newHeight', newHeight)
    
    if newHeight == lastHeight:
        break
    lastHeight = newHeight


    try:

        #print the url that we are scraping
        print('Scraping this url:' + page_url)

        #Exract a list object where each element of the list is a row in the table
        rows = driver.find_elements_by_xpath('//div[@class="col-md-12 col-12 a_ico"]') 
        
        # Extract detail in columns from each row
        for row in rows:
            #Initialize a dictionary for each row
            row_dict = {}

            #Use relative xpaths to locate desired data
            project_name = row.find_element_by_xpath('.//div[@class="ico-row"]/div[2]/h3/a').text
            interest = row.find_element_by_xpath('.//div[@class="interest"]').text
            category = row.find_element_by_xpath('.//div[@class="categ_type"]').text
            received = row.find_element_by_xpath('.//div[@id="new_column_categ_invisted"]/span').text
            goal = row.find_element_by_xpath('.//div[@id="categ_desctop"]').text
            end_date = row.find_element_by_xpath('.//div[@class="date"]').text
            ticker = row.find_element_by_xpath('.//div[@id="t_tikcer"]').text


            # Add extracted data to the dictionary
            row_dict['project_name'] = project_name
            row_dict['interest'] = interest
            row_dict['category'] = category
            row_dict['received'] = received
            row_dict['goal'] = goal
            row_dict['end_date'] = end_date
            row_dict['ticker'] = ticker


            writer.writerow(row_dict.values())


    except Exception as e:
        print(e)
        csv_file.close()
        driver.close()
        break

Without being able to scroll to the bottom of the page my script will only scrape data form the initial page which only constitutes about 10% of all that is available

Diop Chopra
  • 319
  • 3
  • 10
  • 1
    This sounds like an [X-Y problem](http://xyproblem.info/). Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do? – undetected Selenium Nov 22 '21 at 22:49
  • Cheers note taken - I have update – Diop Chopra Nov 22 '21 at 22:56
  • without URL for this page we can't see what is wrong. – furas Nov 23 '21 at 05:41
  • if you use ``print()` to see what you have in variables then you see that `scrollTo` doesn't give value but `None` - and finally you get `newHeight = None` and `lastHeight = None` so `if newHeight == lastHeight` gives `if None == None` – furas Nov 23 '21 at 05:49
  • I tested code with your URL and problem can be that server detects Selenium and it doesn't send new content - JavaScript gets HTML with message `"This website is using a security service to protect itself from online attacks."` with status `403` - so browser can't add new data and scroll it. So real problem is different then you expect. – furas Nov 24 '21 at 00:05

2 Answers2

0

If you use print() to see values in variables then you see that scrollTo gives None and you can't use it to get newHeight.


Minimal working code.

I tested on page http://quotes.toscrape.com/scroll created for learning scraping.

from selenium import webdriver
import time

url = 'http://quotes.toscrape.com/scroll'

driver = webdriver.Firefox()
driver.get(url)

lastHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('lastHeight', lastHeight)

while True:

    driver.execute_script(f"window.scrollTo(0, {lastHeight});")
    time.sleep(1)    
    newHeight = driver.execute_script("return document.documentElement.scrollHeight")
    print('newHeight', newHeight)
    
    if newHeight == lastHeight:
        break
    
    lastHeight = newHeight

BTW:

I found on Stackoverflow answer from 2015 which use the same method but with document.body instead of document.documentElement

How can I scroll a web page using selenium webdriver in python?

So if this code works for you then this question could be closed as duplicate

furas
  • 134,197
  • 12
  • 106
  • 148
  • Thanks but didn't seem to work. I have updated to include my entire code which might make it easier for you to assist – Diop Chopra Nov 23 '21 at 11:54
0

I always use the below piece of code to scroll till bottom, and I have never seen that it fail.

driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")

So, your effective code will be

while True:

    driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
    height = driver.execute_script("return document.documentElement.scrollHeight")
    newHeight = driver.execute_script("window.scrollTo(0, " + str(height) + ");")
    time.sleep(15)
    if newHeight == lastHeight:
        break
    lastHeight = newHeight
cruisepandey
  • 28,520
  • 6
  • 20
  • 38
  • Thanks but didn't seem to work. I have updated to include my entire code which might make it easier for you to assist – Diop Chopra Nov 23 '21 at 11:54