0

I'm new to python (and, well, programming in general), and I'm wanting to scrape data from a webelement that dynamically updates after scrolling using Selenium, similar to this post: Trying to use Python and Selenium to scroll and scrape a webpage iteratively. Similar to the screenshot seen in that question, my webelement is a table of data with headers, which may have both a horizontal or vertical scroll bar.

The first thing I want to do is scroll across my webelement (one column at a time, so as not to skip any columns) and scrape all the headers. So far, I can confirm that I have the correct xpath for my webelement's horizontal scroll bar, and that I am able to scroll horizontally across the webelement one column at a time. See below for my code as is, which is code I have adjusted from this question Python Selenium - Adjust pause_time to scroll down in infinite page:

scraped_headers = []
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
   if header not in scraped_headers:
      scraped_headers.append(header)
      print(header.text)
last_header = scraped_headers[-1]

width_scrollbar = driver.find_element_by_xpath("""/html/body/div[5]/div[2]/div/div/div/div/div[4]/div[5]/div[2]/div[3]""")

while True:
   driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
   time.sleep(.5)
   new_header = driver.find_elements_by_xpath("//div[@class='gbData']")[-1]
   if new_header.text == last_header.text:
      break
   headers = driver.find_elements_by_xpath("//div[@class='gbData']")
   for header in headers:
      if header not in scraped_headers:
         scraped_headers.append(header)
         last_header = scraped_headers[-1]
         print(header.text)

However, I am observing an unexpected behavior which I cannot seem to wrap my head around. A print() of the value for last_header.text just prior to this code:

   driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
   time.sleep(.5)

will show the last header that I scraped (as expected; and therefore will match the print in my first for loop). A print() of the value for last_header.text just after that code will show the latest header in the webelement even though there is no reason (as I understand it) why it should be appended to the list at that point. Consequently, new_header.text will equal last_header.text and my while loop will break.

Interestingly, I can seem to just do the following:

scraped_headers = []
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
   if header not in scraped_headers:
      scraped_headers.append(header)
      print(header.text)
last_header = scraped_headers[-1]

width_scrollbar = driver.find_element_by_xpath("""/html/body/div[5]/div[2]/div/div/div/div/div[4]/div[5]/div[2]/div[3]""")

while True:
   driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
   time.sleep(.5)
   print(last_header.text)

-and my program will print every new header that appears until it just repeats the last one in the list; but I wouldn't know how to break out of the loop!

What is going on? Am I missing something obvious?

Any help is appreciated!

  • the driver is returning a reference to the WebElement, not the element itself. So you've got a list of references to the WebElements on the page at that time. – pcalkins Jan 03 '20 at 23:07
  • so basically the number of divs with class of 'gbData' is not changing (?), but the values are. Once .text is called you get at the actual element in that position at that time. OR the size is changing but the reference stays the same... depends how many elements are found before and after... I'm sort of surprised you don't get a staleElement exception. – pcalkins Jan 03 '20 at 23:25
  • @pcalkins, you are correct, and your comments actually answer my question. I've adjusted my code to append the _text_ contained within the divs with class 'gbData' rather than the reference to the webelement. Scrolling apparently does not alter the number of div's, merely the text contained within them. Thank you! – Charles Waller Jan 06 '20 at 15:34

1 Answers1

1

As @pcalkins points out in comments, appending the .text of the headers rather than a reference to the header webelement solves my problem. This adjusted code accomplishes what I want nicely:

scraped_headers = []
headers = driver.find_elements_by_xpath("//div[@class='gbData']")
for header in headers:
   if header.text not in scraped_headers:
      scraped_headers.append(header.text)
      print(header.text)
last_header = scraped_headers[-1]

width_scrollbar = driver.find_element_by_xpath("""/html/body/div[5]/div[2]/div/div/div/div/div[4]/div[5]/div[2]/div[3]""")

while True:
   driver.execute_script("arguments[0].scrollLeft += 50;", width_scrollbar)
   time.sleep(.5)
   new_header = driver.find_elements_by_xpath("//div[@class='gbData']")[-1]
   if new_header.text == last_header:
      break
   headers = driver.find_elements_by_xpath("//div[@class='gbData']")
   for header in headers:
      if header.text not in scraped_headers:
         scraped_headers.append(header.text)
         last_header = scraped_headers[-1]
         print(header.text)