1

I would like to extract, for example, all the values that are within the "Holdings" from https://www.morningstar.com/funds/xnas/aepfx/portfolio. Some of these values are:

  • Current Portfolio Date = Mar,31 2022
  • Equity Holdings = 384

I tried some different approaches but none of them seem to work.

1st) Tried via:

soup.find_all("div", class_="sal-dp-value")

But this will return empty

What is odd for me is that I don't even find

<div class="sal-dp-value">Mar 31, 2022</div>

when searching on the raw data printed by:

import requests
r = requests.get('https://www.morningstar.com/funds/xnas/aepfx/portfolio')
soup = BeautifulSoup(r.text, "html.parser")
soup.html

Not ideally as I prefer to use Beautifulsoup but also tried via Xpath:

import requests
from lxml import html

page = requests.get("https://www.morningstar.com/funds/xnas/aepfx/portfolio").text
holdings = html.fromstring(page).xpath('/html/body/div[2]/div/div/div[2]/div[3]/div/main/div[2]/div/div/div[1]/sal-components/section/div/div/div[3]/sal-components-mip-holdings/div/div/div/div[2]/div[1]/ul/li[1]/div/div[2]')
holdings

Which will return empty.

Ish similar question:

Rubén
  • 34,714
  • 9
  • 70
  • 166
FFLS
  • 565
  • 1
  • 4
  • 19
  • The site relies heavily on JS and that's how the content is created, so bs4 won't see a thing of it. Also, scraping morningstar is against their ToS. However, you might want to explore [their API](https://developer.morningstar.com/). – baduker Apr 28 '22 at 13:11
  • That makes much more sense. Will take a look at their API, thanks. – FFLS Apr 28 '22 at 13:18

1 Answers1

2

As the content of that site is javascript heavy, bs4 or lxml can't see the content. Instead, try the following approach to fetch your required fields from that site:

import requests

link = 'https://api-global.morningstar.com/sal-service/v1/fund/portfolio/holding/v2/FOUSA06WRH/data'

headers = {
    'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
}

payload = {
    'premiumNum': '100',
    'freeNum': '25',
    'languageId': 'en',
    'locale': 'en',
    'clientId': 'MDC',
    'benchmarkId': 'mstarorcat',
    'component': 'sal-components-mip-holdings',
    'version': '3.59.1'
}

with requests.Session() as s:
    s.headers.update(headers)
    resp = s.get(link,params=payload)
    container = resp.json()
    portfolio_date = container['holdingSummary']['portfolioDate']
    equity_holding = container['numberOfEquityHolding']
    active_share = container['holdingActiveShare']['activeShareValue']
    reported_turnover = container['holdingSummary']['lastTurnover']
    other_holding = container['holdingSummary']['numberOfOtherHolding']
    top_holding = container['holdingSummary']['topHoldingWeighting']
    print(portfolio_date,equity_holding,active_share,reported_turnover,other_holding,top_holding)
robots.txt
  • 96
  • 2
  • 10
  • 36
  • This works amazingly well. Question - is this apikey some sort of public one? wasn't able to find it anywhere. – FFLS Apr 29 '22 at 16:04
  • Yeah, the apikey that I used within the script is public. I found it using chrome dev tools. – robots.txt Apr 30 '22 at 15:10