I'm new to python (and, well, programming in general), and I'm wanting to scrape data from a webelement that dynamically updates after scrolling using Selenium, similar to this post: Trying to use Python and Selenium to scroll and scrape a webpage iteratively. Similar to the screenshot seen in that question, my webelement is a table of data with headers, which may have both a horizontal or vertical scroll bar.
The first thing I want to do is scroll across my webelement (one column at a time, so as not to skip any columns) and scrape all the headers. So far, I can confirm that I have the correct xpath for my webelement's horizontal scroll bar, and that I am able to scroll horizontally across the webelement one column at a time. See below for my code as is, which is code I have adjusted from this question Python Selenium - Adjust pause_time to scroll down in infinite page:
scraped_headers = []
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
if header not in scraped_headers:
scraped_headers.append(header)
print(header.text)
last_header = scraped_headers[-1]
width_scrollbar = driver.find_element_by_xpath("""/html/body/div[5]/div[2]/div/div/div/div/div[4]/div[5]/div[2]/div[3]""")
while True:
driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
time.sleep(.5)
new_header = driver.find_elements_by_xpath("//div[@class='gbData']")[-1]
if new_header.text == last_header.text:
break
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
if header not in scraped_headers:
scraped_headers.append(header)
last_header = scraped_headers[-1]
print(header.text)
However, I am observing an unexpected behavior which I cannot seem to wrap my head around. A print() of the value for last_header.text just prior to this code:
driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
time.sleep(.5)
will show the last header that I scraped (as expected; and therefore will match the print in my first for loop). A print() of the value for last_header.text just after that code will show the latest header in the webelement even though there is no reason (as I understand it) why it should be appended to the list at that point. Consequently, new_header.text will equal last_header.text and my while loop will break.
Interestingly, I can seem to just do the following:
scraped_headers = []
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
if header not in scraped_headers:
scraped_headers.append(header)
print(header.text)
last_header = scraped_headers[-1]
width_scrollbar = driver.find_element_by_xpath("""/html/body/div[5]/div[2]/div/div/div/div/div[4]/div[5]/div[2]/div[3]""")
while True:
driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
time.sleep(.5)
print(last_header.text)
-and my program will print every new header that appears until it just repeats the last one in the list; but I wouldn't know how to break out of the loop!
What is going on? Am I missing something obvious?
Any help is appreciated!