0

So I am trying to scrape a webs table using selenium trying to extract the table with xpath:

previously I tried to look for the table class however no tables where found , so I decided to look for the div element.

xpath="//div[@class='table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller']"
WebDriverWait(driver, 10).until(
        expected_conditions.visibility_of_element_located((By.XPATH, xpath)))
source = driver.page_source
driver.quit()
soup = BeautifulSoup(source, "html5lib")

table = soup.find('div', {'class': 'table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
print(df[0])

The issue I am having is that I am printing only the headers and a first row of values full of nans:

enter image description here

Why am I not getting the values of the table? What it makes it so tough to scrape this content?

EDIT: @DebanjanB was able to provide a nice answer however I am unable to replicate the output, whats the reason behind this?

Matthew Daly
  • 9,212
  • 2
  • 42
  • 83
JamesHudson81
  • 2,215
  • 4
  • 23
  • 42
  • BeautifulSoup(html_source, "html5lib") Anything named html_source? – Arundeep Chohan Sep 19 '20 at 10:59
  • https://stackoverflow.com/questions/63960297/struggling-to-scrap-a-table-using-selenium/63961567#63961567 you had a similar question. Just switch to css selector. – Arundeep Chohan Sep 19 '20 at 11:05
  • @arundeepchohan Thank you for the idea but changing the XPATH to CSS, the issue still exists, where I generate the df with the same row of nans, I decided to post a different question because I think in this case the issue is different as I access the frame but not the values inside it. – JamesHudson81 Sep 19 '20 at 11:19

2 Answers2

1

If you inspect the page requests, you might notice an endpoint offering you just the right info as JSON:

https://api.euroinvestor.dk/indices/21/instruments

You can use pandas to read straight from the URL (you don't even need Selenium):

instruments = pd.read_json('https://api.euroinvestor.dk/indices/21/instruments')

Be sure to look at the API usage terms (especially any rate limits); you might get blocked otherwise.

danuker
  • 861
  • 10
  • 26
  • where can I find the API usage terms ? – JamesHudson81 Sep 20 '20 at 09:34
  • @JamesHudson81 I suspect it's somewhere at the bottom of https://www.euroinvestor.dk/, but I do not speak Danish: Generelle handelsbetingelser | Cookie-og Privatlivspolitik | Cookiedeklaration | Vilkår – danuker Sep 20 '20 at 10:29
0

To extract the contents from the OMX Stockholm 30 <table> using Selenium and you can use the following Locator Strategy:

  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']//following::div[2]//table"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    VÆRDIPAPIR
    KURS
    ÆNDRING I %
    ÆNDRING
    VOLUME
    BUD
    UDBUD
    OPDATERET
    ABB LTD
    229,60
    0,13%
    0,30
    1.953.199 229,20 229,30 18.09.2020
    ALFA LAVAL AB
    210,50
    1,20%
    2,50
    1.513.953 210,30 210,40 18.09.2020
    ASSA ABLOY AB SER. B
    216,00
    1,55%
    3,30
    3.250.421 216,20 216,40 18.09.2020
    ASTRAZENECA PLC
    995,10
    0,56%
    5,50
    507.005 994,70 995,00 18.09.2020
    ATLAS COPCO AB SER. A
    425,60
    1,89%
    7,90
    2.313.361 425,80 426,10 18.09.2020
    ATLAS COPCO AB SER. B
    376,60
    2,78%
    10,20
    971.096 376,60 376,90 18.09.2020
    AUTOLIV INC. SDB
    655,00
    -1,18%
    -7,80
    279.485 656,80 657,40 18.09.2020
    BOLIDEN AB
    275,80
    1,03%
    2,80
    2.450.311 276,60 276,80 18.09.2020
    ELECTROLUX, AB SER. B
    194,60
    0,34%
    0,65
    1.381.656 195,00 195,10 18.09.2020
    ERICSSON, TELEFONAB. L M SER.
    98,26
    1,30%
    1,26
    17.811.892 98,12 98,16 18.09.2020
    ESSITY AB SER. B
    306,40
    -0,20%
    -0,60
    1.795.692 306,20 306,40 18.09.2020
    GETINGE AB SER. B
    188,10
    1,65%
    3,05
    864.843 188,05 188,15 18.09.2020
    HENNES & MAURITZ AB, H &#3
    157,85
    -1,68%
    -2,70
    5.188.908 157,85 157,90 18.09.2020
    HEXAGON AB SER. B
    677,20
    0,06%
    0,40
    776.831 676,20 676,80 18.09.2020
    INVESTOR AB SER. B
    584,60
    1,53%
    8,80
    1.681.508 585,00 585,20 18.09.2020
    KINNEVIK AB SER. B
    336,95
    3,34%
    10,90
    1.118.689 336,35 336,55 18.09.2020
    NORDEA BANK ABP
    68,37
    -1,85%
    -1,29
    11.846.193 68,45 68,48 18.09.2020
    SANDVIK AB
    185,10
    1,54%
    2,80
    3.874.524 185,00 185,10 18.09.2020
    SECURITAS AB SER. B
    140,00
    -0,53%
    -0,75
    1.545.060 140,20 140,35 18.09.2020
    SKANDINAVISKA ENSKILDA BANKEN
    81,38
    -3,46%
    -2,92
    10.968.672 81,38 81,42 18.09.2020
    

Update

As you mentioned within your comments ...either getting a timeout or I am only able to get the headers... that effectively implies our locators are correct and the issue is with rendering and in that case you can scrollIntoView() and you can use the following solution:

driver.get('https://www.euroinvestor.dk/markeder/aktier/sverige/omx-stockholm-30/21')
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']"))))
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='OMX Stockholm 30']//following::div[2]//table"))).text)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you for the response however I am unable to replicate the output. I am either getting a timeout or I am only able to get the headers, I also do not understand the path reference stated, is this also angular ? – JamesHudson81 Sep 20 '20 at 07:17
  • @JamesHudson81 Checkout the answer update and let me know the status. – undetected Selenium Sep 20 '20 at 18:25