1

I just start learning web scraping and trying to extract data from the 'Holdings' table from https://www.ishares.com/us/products/268752/ishares-global-reit-etf

First, I use pandas but it returns me empty dataframe. I found out later that this table is dynamic and I need to use selenium. But then again, it also returns me empty dataframe. Could anyone help me with this please? Will really appreciate it.

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(site)

# Load the HTML page
html = wd.page_source

# Extract data with pandas
df = pd.read_html(html)
table = df[6]
chitown88
  • 27,527
  • 4
  • 30
  • 59
Mango
  • 27
  • 6

1 Answers1

1

To extract the data from the Holdings table of iShares Global REIT ETF webpage you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:

Code Block:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

wd.get("https://www.ishares.com/us/products/268752/ishares-global-reit-etf")
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
df  = pd.read_html(data)
# df  = pd.read_html(data, flavor='html5lib')
print(df)

Console Output:

[  Ticker                                Name       Sector Asset Class  ...      CUSIP          ISIN    SEDOL  Accrual Date
0    PLD                   PROLOGIS REIT INC  Real Estate      Equity  ...  74340W103  US74340W1036  B44WZD7             -
1   EQIX                    EQUINIX REIT INC  Real Estate      Equity  ...  29444U700  US29444U7000  BVLZX12             -
2    PSA                 PUBLIC STORAGE REIT  Real Estate      Equity  ...  74460D109  US74460D1090  2852533             -
3    SPG       SIMON PROPERTY GROUP REIT INC  Real Estate      Equity  ...  828806109  US8288061091  2812452             -
4    DLR       DIGITAL REALTY TRUST REIT INC  Real Estate      Equity  ...  253868103  US2538681030  B03GQS4             -
5      O             REALTY INCOME REIT CORP  Real Estate      Equity  ...  756109104  US7561091049  2724193             -
6   WELL                       WELLTOWER INC  Real Estate      Equity  ...  95040Q104  US95040Q1040  BYVYHH4             -
7    AVB      AVALONBAY COMMUNITIES REIT INC  Real Estate      Equity  ...  053484101  US0534841012  2131179             -
8    ARE  ALEXANDRIA REAL ESTATE EQUITIES RE  Real Estate      Equity  ...  015271109  US0152711091  2009210             -
9    EQR             EQUITY RESIDENTIAL REIT  Real Estate      Equity  ...  29476L107  US29476L1070  2319157             -

[10 rows x 12 columns]]
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you so much DebanjanB. I am not fluent in coding, so I find it quite difficult. – Mango Dec 23 '21 at 07:58
  • I try the code but it says "NameError: name 'driver' is not defined", so I put in "driver = webdriver.Chrome()" and it says "WebDriverException: chrome not reachable (Session info: chrome=96.0.4664.110)" – Mango Dec 23 '21 at 08:02
  • @Mango Checkout the updated answer and let me know the status. – undetected Selenium Dec 23 '21 at 09:05
  • I still get error message, saying "NameError: name 'wd' is not defined" Maybe I have to declare 'wd' first or install some thing in my PC?? I am sorry but I am not coder or developer, my coding knowledge is still quite limited =( – Mango Dec 24 '21 at 04:45
  • Thank you so much @undetected Selenium, I got it work finally with WebDriver Wait for data to fully loaded. However, the output is somewhat truncated from 400+ rows down to only 10 rows. How should I do to load all the 400+ rows? Your help will be very much appreciated – Mango Jan 13 '22 at 15:25