Extract data from embedded script tag in html

Question

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.

What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically ["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)

Following 1 or 2 cannot get the case done.

What I've done so far

base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]

This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.

What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)

QHarr · Accepted Answer · 2019-09-04T21:24:05.100

You could regex out javascript object holding that item then parse with json library

import requests,re,json

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)

Or do whole thing with regex:

import requests,re

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])

Second regex:

Another option:

import requests,re, json

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

This is an easy to follow solution which works very well! With the provided Regex explanantion I'm able to track what's going on beneath the hood. Your and third solutions are great. To be specific I think the third one is idiomatic, easy to follow for the beginners such as me. — Wokkel, Sep 05 '19 at 09:19

Extract data from embedded script tag in html

1 Answers1