Using Pyppeteer to download CSV / Excel file from Vanguard via JavaScript

Question

I'm trying to automate downloading the holdings of Vanguard funds from the web. The links resolve through JavaScript so I'm using Pyppeteer but I'm not getting the file. Note, the link says CSV but it provides an Excel file.

From my browser it works like this:

Go to the fund URL, eg https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio
Follow the link, "See count total holdings"
Click the link, "Export to CSV"

My attempt to replicate this in Python follows. The first link follow seems to work because I get different HTML but the second click gives me the same page, not a download.

import asyncio
from pyppeteer import launch
import os

async def get_page(browser, url):
    page = await browser.newPage()
    await page.goto(url)
    return page

async def fetch(url):
    browser = await launch(options={'args': ['--no-sandbox']})  #headless=True, 
    page = await get_page(browser, url)     
    await page.waitFor(2000)
    
    # save the page so we can see the source
    wkg_dir = 'vanguard'
    t_file = os.path.join(wkg_dir,'8225.htm')
    with open(t_file, 'w', encoding="utf-8") as ef:
        ef.write(await page.content())
    
    accept = await page.xpath('//a[contains(., "See count total holdings")]')
    print(f'Found {len(accept)} "See count total holdings" links')
    
    if accept:
        await accept[0].click()
        await page.waitFor(2000)
    else:
        print('DID NOT FIND THE LINK')
        return False
    
    # save the pop-up page for debug
    t_file = os.path.join(wkg_dir,'first_page.htm')
    with open(t_file, 'w', encoding="utf-8") as ef:
        ef.write(await page.content())
    
    links = await page.xpath('//a[contains(., "Export to CSV")]')
    print(f'Found {len(links)} "Export to CSV" links')    # 3 of these
    for i, link in enumerate(links):
        print(f'Trying link {i}')
        await link.click()
        await page.waitFor(2000)
        
        t_file = os.path.join(wkg_dir,f'csv_page{i}.csv')
        with open(t_file, 'w', encoding="utf-8") as ef:
            ef.write(await page.content())

    return True

#---------- Main ------------
# Set constants and global variables
url = 'https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio'
loop = asyncio.get_event_loop()
status = loop.run_until_complete(fetch(url))

Would love to hear suggestions from anyone that knows Puppeteer / Pyppeteer well.

Vanguard has changed their web page and this code no longer applies — InnocentBystander, Jan 05 '23 at 23:29

ggorlen · Accepted Answer · 2023-05-17T16:21:30.160

First of all, page.waitFor(2000) should be the last resort. That's a race condition that can lead to a false negative at worst and slows your scrape down at best. I recommend page.waitForXPath which spawns a tight polling loop to continue your code as soon as the xpath becomes available.

Also on the topic of element selection, I'd use text() in your xpath instead of . which is more precise.

I'm not sure how ef.write(await page.content()) is working for you -- that should only give page HTML, not the XLSX download. The link click triggers downloads via a dialog. Accepting this download involves enabling Chrome downloads with

await page._client.send("Page.setDownloadBehavior", {
    "behavior": "allow", 
    "downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})

The next hurdle is bypassing or suppressing the "multiple downloads" permission prompt Chrome displays when you try to download multiple files on the same page. I wasn't able to figure out how to stop this, so my code just navigates to the page for each link as a workaround. I'll leave my solution as sub-optimal but functional and let others (or my future self) improve on it.

By the way, two of the XLSX files at indices 1 and 2 seem to be identical. This code downloads all 3 anyway, but you can probably skip the last depending on whether the page changes or not over time -- I'm not familiar with it.

I'm using a trick for clicking non-visible elements, using the browser console's click rather than Puppeteer: page.evaluate("e => e.click()", csv_link)

Here's my attempt:

import asyncio
from pyppeteer import launch


async def get_csv_links(page):
    await page.goto(url)

    xp = '//a[contains(text(), "See count total holdings")]'
    await page.waitForXPath(xp)
    accept, = await page.xpath(xp)
    await accept.click()

    xp = '//a[contains(text(), "Export to CSV")]'
    await page.waitForXPath(xp)
    return await page.xpath(xp)


async def fetch(url):
    browser = await launch(headless=False)
    page, = await browser.pages()
    await page._client.send("Page.setDownloadBehavior", {
        "behavior": "allow", 
        "downloadPath": r"C:\Users\you\Desktop" # TODO set your path
    })
    csv_links = await get_csv_links(page)

    for i in range(len(csv_links)):
        # open a fresh page each time as a hack to avoid multiple file prompts
        csv_link = (await get_csv_links(page))[i]
        await page.evaluate("e => e.click()", csv_link)

        # let download finish; this is a race condition
        await page.waitFor(3000)


if __name__ == "__main__":
    url = "https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio"
    asyncio.run(fetch(url))

Notes for improvement:

Try an arg like --enable-parallel-downloading or a setting like 'profile.default_content_setting_values.automatic_downloads': 1 to suppress the "multiple downloads" warning.
Figure out how to wait for all downloads to complete so the final waitFor(3000) can be removed. Another option here might involve polling for the files you expect; you can visit the linked thread for ideas.
Figure out how to download headlessly.

Using Pyppeteer to download CSV / Excel file from Vanguard via JavaScript

1 Answers1