Scraping data from webpage with data download delay

Question

I have tried looking at a couple of questions on this site with this problem but I can't get their solutions working. I am using python and selenium with a chrome headless browser to scrape bond data from vanguard. Vanguard loads the data on the page on a delay and I can't figure out how to get the data in properly.

I am trying to load data from this webpage, specifically the data from the fund facts table

When I tried doing this as I typically do I get

<iframe data-delayed-src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?" id="floodIframe" src="https://fls.doubleclick.net/activityi;src=844392;u7=vgmf;type=remar743;cat=mutua911;u1=prd;ord=1632433243910?"></iframe>

So I tried using this line of code to get the browser to wait until the data is loaded.

WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))

I am sure this is on the right track but I don't know how to properly tell what element I should be waiting to indentify and if I am doing it correctly. Is there a way for me to wait until the iframe data-delayed-src element goes away to get the data?

I have seen usages of it with By.ID but I don't see any elements in the data html that I want that have an id.

Here is the code I am using

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os

dirname = os.path.dirname(__file__)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))
# WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "data-ng-class")))

html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
print('table: \n',table)

The table prints out missing all the data I want like this

 <tbody>
<!-- ngRepeat: item in genericTableData.items -->
</tbody>

sounds like what you want is expected condition of frameToBeAvailableAndSwitchToIt: https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/support/ui/ExpectedConditions.html#frameToBeAvailableAndSwitchToIt(int) Note that the driver will then be in that iframe. So you need to switch back out of it to act on the parent frame/page. Also Page Source may not contain what you need. You may need to target specific webelements. — pcalkins, Sep 23 '21 at 22:14
@pcalkins would it be something like [this](https://stackoverflow.com/questions/14515120/how-do-i-wait-for-a-specific-frame-to-load-im-using-selenium-webdriver-2-24)? With the frame name being 'iframe data-delayed-src' — Ben Cole, Sep 23 '21 at 22:27
yes, but use a locator... I would target the ID attribute: id="floodIframe" So you could still use your By... By.XPATH, "//iframe[@id='floodIframe'] Something like that... — pcalkins, Sep 23 '21 at 22:43

Kamalesh S · Accepted Answer · 2021-09-24T16:03:13.283

0

I used the XPath of the Fund facts table in the WebDriverWait statement to get it working.

Code snippet:-

symbol = 'vbirx'
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
browser.get(url_vanguard.format(symbol))

#waiting for the fund facts table to load
WebDriverWait(browser, 15).until(EC.presence_of_element_located((By.XPATH,'//*[@class="summary-table historical-table col2Wide"]')))

html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.find('table',{'role':'presentation'})
table = htmlData.find('tbody')
rows = table.find_all('td')
for row in rows:
    span=row.find('span')
    print(span.text)

edited Sep 24 '21 at 16:03

answered Sep 24 '21 at 05:39

Kamalesh S

522
1
5
14

This got me the labels of the table but the values are still missing. I've tried some other elements on the page but can't find one that loads late enough to give me the values – Ben Cole Sep 24 '21 at 09:33
I am able to get both the labels and the values as well. I will update the code for parsing in the answer. – Kamalesh S Sep 24 '21 at 16:00

Scraping data from webpage with data download delay

1 Answers1