3

So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.

After inspecting the element of the piece of data I wanted to extract, I saw this: <span id="TSMAIN">: 100.7490637</span> The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).

Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.

My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id = TSMAIN, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.

Code:

from bs4 import BeautifulSoup 
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)

Output

None

How can I extract all the data points of this graph?

Vishal Jain
  • 443
  • 4
  • 17

1 Answers1

5

Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0, then the numbers after are relative to that date.

import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta

userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}


n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')


url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'

data = []
idDict = {
        'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
        'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
        'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
        '':'F00000ZOR1]7]0]IXALL$$ALL',}


for k, v in idDict.items():
    payload = {
    'encyId': 'GBP',
    'idtype': 'Morningstar',
    'frequency': 'daily',
    'startDate':  dateStr,
    'performanceType': '',
    'outputType': 'COMPACTJSON',
    'id': v,
    'decPlaces': '8',
    'applyTrackRecordExtension': 'false'}
    
    
    temp_data = requests.get(url, params=payload).json()
    df = pd.DataFrame(temp_data)
    df['timestamp'] = pd.to_datetime(df[0], unit='ms')
    df['date'] = df['timestamp'].dt.date 
    df = df[['date',1]]  
    df.columns = ['date', k]
    data.append(df)         

final_df = pd.concat(
    (iDF.set_index('date') for iDF in data),
    axis=1, join='inner'
).reset_index()


final_df.plot(x="date", y=list(idDict.keys()), kind="line")

Output:

print (final_df.head(5).to_string())
         date  Schroder Managed Balanced Instl Acc  GBP Moderately Adventurous Allocation  Mixed Investment 40-85% Shares          
0  2019-12-22                             0.000000                               0.000000                        0.000000  0.000000
1  2019-12-23                             0.357143                               0.406784                        0.431372  0.694508
2  2019-12-24                             0.714286                               0.616217                        0.632422  0.667586
3  2019-12-25                             0.714286                               0.616217                        0.632422  0.655917
4  2019-12-26                             0.714286                               0.612474                        0.629152  0.664124
....

To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.

enter image description here

Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.

enter image description here

Here's the final output/graph:

enter image description here

chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Thanks for the solution, but does that mean its impossible to use webscraping to get these sorts of data? I wouldve thought maybe the answer was another package like scrapy or maybe selenium? Ive heard these names before when it comes to webscraping. Just want to check what to do in the case where APIs dont exist as a solution – Vishal Jain Jun 23 '20 at 09:10
  • 1
    For dynamic sites where the data is rendered by javascript, you're correct, using a simple request won't work. Selenium could certainly work here, and for those sites with no api, as it allows the html to be rendered before pulling the html source. APIs though are nice in that it's structure is consistent and you usually get even more data that what you see on the site. Also SOMETIMES, you will see the data in json format in the ` – chitown88 Jun 23 '20 at 09:42
  • Where did you find the documentation for this api? I cant seem to find it anywhere on the web. I would like to scrape some other graphs off morning star. – Vishal Jain Jul 13 '20 at 11:51
  • couldn't find the documentation either. Simply had to look at the dev tools to sort of see what values/parameters to use as I click around the site. – chitown88 Jul 13 '20 at 12:26
  • Do you have a link to what page you saw? Im not sure how you discovered the arguements: 'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL', 'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL', 'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL', '':'F00000ZOR1]7]0]IXALL$$ALL',} I get the keys are just the names of the lines but where did you find these values? – Vishal Jain Jul 13 '20 at 12:27
  • I added a little more info in the solutions. It's not as straightforward. I had to do a little bit of trial and error to find it, but those values do come back in a separate response. – chitown88 Jul 13 '20 at 14:31
  • Thanks for the edit, I managed to find the IDs as youve mentioned in your answer but the issue is this. Youve got the ID for, lets say, GBP Moderately Adventurous Allocation as EUCA000916]8]0]CAALL$$ALL. Inspecting the HTML I can only find EUCA00091680 under ID, where did you discover the CAALL$$ALL part at the end? Sorry for the extended discussion but it would help me out a lot to be able to re use this. – Vishal Jain Jul 14 '20 at 10:54
  • 1
    Ahhh. i see what you mean. Again ya, just sort of had to search around for it (I'll add another photo...there 4 requests made that had that in there). Like I said I said before, this took me a lot of trial and error and just playing around. This site isn't exactly a straightforward one to scrape, none-the-less still good way to learn how to do all this. – chitown88 Jul 15 '20 at 07:36
  • @chitown88 Really love this. Could you ELI5 a little bit more? I'm a scraping n00b as well. – Alexander Dec 14 '20 at 06:24
  • @chitown88 Specifically 1. How did you get the url? The url you used is different from the one in the question. 2. In the id-dict, how did you know which of the values corresponded to each line? 3. How did you know the 4 "t92wz0sj..." requests corresponded to each of the 4 line graphs (other than there are 4 of them?) 4. How did you know to look the at "XHR" tab? What do the other Network tabs mean? – Alexander Dec 14 '20 at 06:30
  • @chitown88 How would you go about scraping this table? https://www.morningstar.com/stocks/xnas/goog/performance – Alexander Dec 14 '20 at 07:48
  • @Alexander I’ll take a look tomorrow – chitown88 Dec 14 '20 at 09:03
  • @Alexander actually, email me (with the questions too). Might be easier to take this conversation away from the comments. Jason.schvach@gmail.com – chitown88 Dec 14 '20 at 09:06