1

I´m trying to extract some scraping of a 221x7 table in selenium. Since my first approach takes approx. 3sec, i was wondering, what is the fastest way and best practice at the same moment.

1st: 3.6sec

table_content = driver_lsx_watchlist.find_elements(By.XPATH, '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody''')
table_content = table_content[0].text
table_content = table_content.splitlines()
for i in range(0, len(table_content)):
    print(f'{i} {table_content[i]}')

2nd: about 200sec!!!

for row in range(1, 222):
    row_text = ''
    for column in range (1,7):
        xpath = '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody/tr[''' + str(row) + ''']/td[''' + str(column) + ''']/div'''
        row_text = row_text + driver_lsx_watchlist.find_elements(By.XPATH, xpath)[0].text
    print(row_text)

3rd: a bit over 4sec

print(driver_lsx_watchlist.find_element(By.XPATH, "/html/body").text)

4th: 0.2sec

ActionChains(driver_lsx_watchlist)\
    .key_down(Keys.CONTROL)\
    .send_keys("a")\
    .key_up(Keys.CONTROL)\
    .key_down(Keys.CONTROL)\
    .send_keys("c")\
    .key_up(Keys.CONTROL)\
    .perform()

Since the clipboard seems to be the fastest of all, but renders my pc useless since the clipboard itself is occupied by the process, i wonder what the best practice would be and if i get a proper solution with under 1 second while using the very same pc.

petezurich
  • 9,280
  • 9
  • 43
  • 57
ngn16920
  • 67
  • 7
  • can you confirm the url where this table resides? – Barry the Platipus Jul 28 '22 at 20:34
  • It is a dynamic table. The content I get via print() is also as expected. The URL is https://www.ls-x.de/de/watchlist The table will be 0 length for you unless you add items to your watch list. I can sample copy/paste some code from my watchlist´s source could if it helps. – ngn16920 Jul 28 '22 at 20:37
  • @ngn16920 Unless you add items to your watch list, it appears empty – undetected Selenium Jul 28 '22 at 20:46

1 Answers1

1

To scrape table within the webpage you need to induce WebDriverWait for the visibility_of_element_located() for the <table> element and using DataFrame from Pandas you can use the following Locator Strategy:

driver.execute("get", {'url': 'https://www.ls-x.de/de/watchlist'})
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.accept"))).click()
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='page_content']/div/div/div/div/module/div/table"))).get_attribute("outerHTML")
df  = pd.read_html(data)
print(df)

Note: You have to add the following imports :

import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks alot. This works like a charm. Still, why is the "selenium only" approach so much slower than adding pandas to the fray? – ngn16920 Aug 02 '22 at 19:04
  • 1
    @ngn16920 Using Selenium you are dealing with the fully [rendered](https://stackoverflow.com/a/47237075/7429447) HTML [DOM](https://www.w3schools.com/js/js_htmldom.asp). Hence the _`visibility_of_element_located()`_ and it's a part of functional testing. But if you look from performance perspective (only data extraction) Selenium may not be the best fit. Requests module will fit better. – undetected Selenium Aug 02 '22 at 19:15