1

I want to extract data from this site: https://www.vanguardinvestor.co.uk/investments/vanguard-s-and-p-500-ucits-etf-usd-distributing/distributions

However, I do not get any results. I found that every row starts with something like this:

<tr ng-if="portSpecific.data.distributionHistory.domicile !== 'GB'" data-ng-repeat="fundDistribution in distributionHistoryList | limitTo:10  " data-ng-include="'${app-content-context}partials/includes/detail/distribution-rows.html' | configReplace | vuiCacheBuster" class="" style="">    <td class="vuiFixedCol fundDistributionType">Income Distribution</td>
    <td class="alignRgt mostRecent"><span data-ng-bind-html="fundDistribution.mostRecent.currencySymbol">$</span>0.250768
    </td>
    <!----><td class="exDividendDate" data-ng-if="fund.data.assetClass !== 'Money Market'">24 Sep 2020</td><!---->
    <td class="recordDate">25 Sep 2020</td>
    <td class="payableDate">07 Oct 2020</td></tr>

When I want to search for a <tr> element I do not find any results, where am I missing something?

import requests
from bs4 import BeautifulSoup

url = 'https://www.vanguardinvestor.co.uk/investments/vanguard-s-and-p-500-ucits-etf-usd-distributing/distributions'

data = requests.get(url)
soup = BeautifulSoup(data.text, 'html.parser')

data = []
for tr in soup.find_all('tr'):
    values = [td.text for td in tr.find_all('td')]
    print(values)
print(data)
MendelG
  • 14,885
  • 4
  • 25
  • 52

1 Answers1

1

The website is loaded dynamically, so requests doesn't support it.

However, we can get the data by sending a GET request to websites API.

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://api.vanguard.com/rs/gre/gra/1.7.0/datasets/urd-product-port-specific.jsonp?vars=portId:9503,issueType:F&callback=angular.callbacks._4"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

fmt_string = "{:<25} {:<20} {:<20} {:<20} {:<20}"
print(
    fmt_string.format(
        "Distribution type",
        "Most recent",
        "Ex-dividend date",
        "Record date",
        "Payable data",
    )
)
print("-" * 105)

json_data = json.loads(re.search(r"({.*})", str(soup)).group(1))

for data in json_data["distributionHistory"]["fundDistributionList"]:
    distribution = data["type"]
    most_recent = data["mostRecent"]["value"]
    dividend_data = data["exDividendDate"]
    record_data = data["recordDate"]
    payable_data = data["payableDate"]

    print(
        fmt_string.format(
            distribution, most_recent, dividend_data, record_data, payable_data
        )
    )

Output:

Distribution type         Most recent          Ex-dividend date     Record date          Payable data        
---------------------------------------------------------------------------------------------------------
Income Distribution       0.250768             24 Sep 2020          25 Sep 2020          07 Oct 2020         
Income Distribution       0.195290             11 Jun 2020          12 Jun 2020          24 Jun 2020         
Income Distribution       0.289243             26 Mar 2020          27 Mar 2020          08 Apr 2020         
Income Distribution       0.202612             12 Dec 2019          13 Dec 2019          27 Dec 2019 
    
...And on
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • Thanks for the clear answer! Sorry for the late response, I had two questions : Where did you find the Vanguard SP500 JSON API ? And second, would you mind telling me how this section works ? json.loads(re.search(r"({.*})", str(soup)).group(1)) Thanks in advance. – Joey Schuitemaker Dec 20 '20 at 17:58
  • @JoeySchuitemaker 1. In your browser, open the DevTools. (In chrome) _right click -> inspect -> Network._ Here you can see all the requests. 2. the data we want is within curly braces `{}`, so in order to find that data we use `re.search(r"({.*})..`. – MendelG Dec 20 '20 at 18:04