I am currently attempting to scrape a DropBox Folder using Selenium on Python. Apparently, if I try to select all hyperlinks (or all elements containing hyperlinks), I only get the first 20 or so results. To give a minimum working example:
from selenium import webdriver
browser = webdriver.Chrome()
page = www.dropbox.com/FolderName
browser.get(page)
elementlist = browser.find_elements_by_class_name('brws-file-name-cell-filename')
#or alternatively, you can simply use the 'by_tag_name('a') method, which yields similar results)
elength = len(elementlist)
Usually, elength
is in the order of 20 to 30 elements, which grows to 30 to 40 I add a command to scroll down to the bottom of the page. I know for a fact that there are well over 200 elements in the folder I am trying to scrape. My question is, thus: is there any way to scroll down the page progressively, rather than going all the way to the bottom right away? I have seen that many questions asked on the same topic focus on pages with infinite loading, like Facebook or other social media. My page, on the other hand, has a fixed length. Is there a way I can scroll down step by step, rather than all at once?
UPDATE
I tried following the advice given to me by the community and by the answer you can find here. Unfortunately, I am still struggling to iterate over the height, which is my variable of interest and which seems to be stuck in a string. This has been my best attempt at creating a for loop over the height, and needless to say, it still did not work.
# Get current height
height = browser.execute_script("return document.body.scrollHeight")
while True:
# Scroll down
browser.execute_script('window.scrollTo(0, window.scroll'+str(height)+' + 200)')
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == height:
break
else:
height = new_height
UPDATE 2
I think I've found the issue. Dropbox basically has a 'page within the page' structure. The whole of the page is visible to me, but there's an inner archive which I need to navigate. Any idea how to do that?