A tough scraping case using selenium

Question

So I am trying to scrape a webs table using selenium trying to extract the table with xpath:

previously I tried to look for the table class however no tables where found , so I decided to look for the div element.

xpath="//div[@class='table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller']"
WebDriverWait(driver, 10).until(
        expected_conditions.visibility_of_element_located((By.XPATH, xpath)))
source = driver.page_source
driver.quit()
soup = BeautifulSoup(source, "html5lib")

table = soup.find('div', {'class': 'table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
print(df[0])

The issue I am having is that I am printing only the headers and a first row of values full of nans:

Why am I not getting the values of the table? What it makes it so tough to scrape this content?

EDIT: @DebanjanB was able to provide a nice answer however I am unable to replicate the output, whats the reason behind this?

BeautifulSoup(html_source, "html5lib") Anything named html_source? — Arundeep Chohan, Sep 19 '20 at 10:59
https://stackoverflow.com/questions/63960297/struggling-to-scrap-a-table-using-selenium/63961567#63961567 you had a similar question. Just switch to css selector. — Arundeep Chohan, Sep 19 '20 at 11:05
@arundeepchohan Thank you for the idea but changing the XPATH to CSS, the issue still exists, where I generate the df with the same row of nans, I decided to post a different question because I think in this case the issue is different as I access the frame but not the values inside it. — JamesHudson81, Sep 19 '20 at 11:19

score 1 · Accepted Answer · answered Sep 20 '20 at 08:47

1

If you inspect the page requests, you might notice an endpoint offering you just the right info as JSON:

https://api.euroinvestor.dk/indices/21/instruments

You can use pandas to read straight from the URL (you don't even need Selenium):

instruments = pd.read_json('https://api.euroinvestor.dk/indices/21/instruments')

Be sure to look at the API usage terms (especially any rate limits); you might get blocked otherwise.

answered Sep 20 '20 at 08:47

danuker

861
10
26

where can I find the API usage terms ? – JamesHudson81 Sep 20 '20 at 09:34
@JamesHudson81 I suspect it's somewhere at the bottom of https://www.euroinvestor.dk/, but I do not speak Danish: Generelle handelsbetingelser | Cookie-og Privatlivspolitik | Cookiedeklaration | Vilkår – danuker Sep 20 '20 at 10:29

undetected Selenium · Answer 2 · 2020-09-20T18:24:50.357

To extract the contents from the OMX Stockholm 30 <table> using Selenium and python you can use the following Locator Strategy:

Using XPATH:

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']//following::div[2]//table"))).text)

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Console Output:

VÆRDIPAPIR
KURS
ÆNDRING I %
ÆNDRING
VOLUME
BUD
UDBUD
OPDATERET
ABB LTD
229,60
0,13%
0,30
1.953.199 229,20 229,30 18.09.2020
ALFA LAVAL AB
210,50
1,20%
2,50
1.513.953 210,30 210,40 18.09.2020
ASSA ABLOY AB SER. B
216,00
1,55%
3,30
3.250.421 216,20 216,40 18.09.2020
ASTRAZENECA PLC
995,10
0,56%
5,50
507.005 994,70 995,00 18.09.2020
ATLAS COPCO AB SER. A
425,60
1,89%
7,90
2.313.361 425,80 426,10 18.09.2020
ATLAS COPCO AB SER. B
376,60
2,78%
10,20
971.096 376,60 376,90 18.09.2020
AUTOLIV INC. SDB
655,00
-1,18%
-7,80
279.485 656,80 657,40 18.09.2020
BOLIDEN AB
275,80
1,03%
2,80
2.450.311 276,60 276,80 18.09.2020
ELECTROLUX, AB SER. B
194,60
0,34%
0,65
1.381.656 195,00 195,10 18.09.2020
ERICSSON, TELEFONAB. L M SER.
98,26
1,30%
1,26
17.811.892 98,12 98,16 18.09.2020
ESSITY AB SER. B
306,40
-0,20%
-0,60
1.795.692 306,20 306,40 18.09.2020
GETINGE AB SER. B
188,10
1,65%
3,05
864.843 188,05 188,15 18.09.2020
HENNES & MAURITZ AB, H &#3
157,85
-1,68%
-2,70
5.188.908 157,85 157,90 18.09.2020
HEXAGON AB SER. B
677,20
0,06%
0,40
776.831 676,20 676,80 18.09.2020
INVESTOR AB SER. B
584,60
1,53%
8,80
1.681.508 585,00 585,20 18.09.2020
KINNEVIK AB SER. B
336,95
3,34%
10,90
1.118.689 336,35 336,55 18.09.2020
NORDEA BANK ABP
68,37
-1,85%
-1,29
11.846.193 68,45 68,48 18.09.2020
SANDVIK AB
185,10
1,54%
2,80
3.874.524 185,00 185,10 18.09.2020
SECURITAS AB SER. B
140,00
-0,53%
-0,75
1.545.060 140,20 140,35 18.09.2020
SKANDINAVISKA ENSKILDA BANKEN
81,38
-3,46%
-2,92
10.968.672 81,38 81,42 18.09.2020

Update

As you mentioned within your comments ...either getting a timeout or I am only able to get the headers... that effectively implies our locators are correct and the issue is with rendering and in that case you can scrollIntoView() and you can use the following solution:

driver.get('https://www.euroinvestor.dk/markeder/aktier/sverige/omx-stockholm-30/21')
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']"))))
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']//following::div[2]//table"))).text)

Thank you for the response however I am unable to replicate the output. I am either getting a timeout or I am only able to get the headers, I also do not understand the path reference stated, is this also angular ? — JamesHudson81, Sep 20 '20 at 07:17
@JamesHudson81 Checkout the answer update and let me know the status. — undetected Selenium, Sep 20 '20 at 18:25

A tough scraping case using selenium

2 Answers2

Update