Parsing JSON out of HTML with Beautifulsoup

Question

import json
import re

from bs4 import BeautifulSoup

data = """
<script data-hid="ld-json-ld.1551860" data-n-head="ssr" preserve="preserve" type="application/ld+json">{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860"},"headline":"Plötzlich ist  das Klimaziel in Griffweite | NZZ","datePublished":"2020-04-15T12:33:47.623Z","dateModified":"2020-04-15T12:35:01.841Z","publisher":{"@type":"Organization","name":"Neue Zürcher Zeitung AG, Schweiz","url":"https://www.nzz.ch","logo":{"@type":"ImageObject","url":"https://www.nzz.ch/logo.png","width":413,"height":60},"contactPoint":[{"@type":"ContactPoint","telephone":"+41-44-2581000","contactType":"customer service"}],"sameAs":["https://www.facebook.com/nzz","https://www.twitter.com/nzz","https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ","https://www.linkedin.com/company/neue-zurcher-zeitung","https://plus.google.com/+nzz/","http://www.freebase.com/m/041b43"]},"description":"Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.","isAccessibleForFree":false,"hasPart":{"@type":"WebPageElement","isAccessibleForFree":false,"cssSelector":".regwalled"},"image":{"@type":"ImageObject","url":"https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg","width":"7050","height":"4705"},"author":{"@type":"Person","name":"David Vonplon"}}</script>"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"window.Rent.data\s+=\s+(\{.*?\});\n")
script = soup.find("script", text=pattern)

print(script)

I want to parse out the JSON part of this Code. But i'm getting just a "None".

Also If I'm trying this.

soup.find("script").text

Output: ''

Can somebody help me where I'm making a mistake?

Its a simplification of a bigger code, which was still running 4, 5 months ago and now it doesnt work anymore and I just don't have an idea what I'm doing wrong.

Thank you very much. Marco

https://stackoverflow.com/questions/43852187/beautifulsoup-extract-json-from-js Look at this. — Gunesh Shanbhag, Aug 02 '20 at 13:32
Does this answer your question? [BeautifulSoup - extract json from JS](https://stackoverflow.com/questions/43852187/beautifulsoup-extract-json-from-js) — Gunesh Shanbhag, Aug 02 '20 at 13:33
Thank you very much, but no. I tried this already and as you see, I took the code from this link and not my original one. — Marco_CH, Aug 02 '20 at 13:41

bigbounty · Answer 1 · 2020-08-02T14:04:22.807

In the find give any attribute of script as a filter.

import json

from bs4 import BeautifulSoup

data = """
<script data-hid="ld-json-ld.1551860" data-n-head="ssr" preserve="preserve" type="application/ld+json">{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860"},"headline":"Plötzlich ist  das Klimaziel in Griffweite | NZZ","datePublished":"2020-04-15T12:33:47.623Z","dateModified":"2020-04-15T12:35:01.841Z","publisher":{"@type":"Organization","name":"Neue Zürcher Zeitung AG, Schweiz","url":"https://www.nzz.ch","logo":{"@type":"ImageObject","url":"https://www.nzz.ch/logo.png","width":413,"height":60},"contactPoint":[{"@type":"ContactPoint","telephone":"+41-44-2581000","contactType":"customer service"}],"sameAs":["https://www.facebook.com/nzz","https://www.twitter.com/nzz","https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ","https://www.linkedin.com/company/neue-zurcher-zeitung","https://plus.google.com/+nzz/","http://www.freebase.com/m/041b43"]},"description":"Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.","isAccessibleForFree":false,"hasPart":{"@type":"WebPageElement","isAccessibleForFree":false,"cssSelector":".regwalled"},"image":{"@type":"ImageObject","url":"https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg","width":"7050","height":"4705"},"author":{"@type":"Person","name":"David Vonplon"}}</script>"""

soup = BeautifulSoup(data, "html.parser")

print(json.loads(soup.find("script", {"preserve":"preserve"}).get_text(strip=True)))

Output:

{'@context': 'http://schema.org', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860'}, 'headline': 'Plötzlich ist  das Klimaziel in Griffweite | NZZ', 'datePublished': '2020-04-15T12:33:47.623Z', 'dateModified': '2020-04-15T12:35:01.841Z', 'publisher': {'@type': 'Organization', 'name': 'Neue Zürcher Zeitung AG, Schweiz', 'url': 'https://www.nzz.ch', 'logo': {'@type': 'ImageObject', 'url': 'https://www.nzz.ch/logo.png', 'width': 413, 'height': 60}, 'contactPoint': [{'@type': 'ContactPoint', 'telephone': '+41-44-2581000', 'contactType': 'customer service'}], 'sameAs': ['https://www.facebook.com/nzz', 'https://www.twitter.com/nzz', 'https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ', 'https://www.linkedin.com/company/neue-zurcher-zeitung', 'https://plus.google.com/+nzz/', 'http://www.freebase.com/m/041b43']}, 'description': 'Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.', 'isAccessibleForFree': False, 'hasPart': {'@type': 'WebPageElement', 'isAccessibleForFree': False, 'cssSelector': '.regwalled'}, 'image': {'@type': 'ImageObject', 'url': 'https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg', 'width': '7050', 'height': '4705'}, 'author': {'@type': 'Person', 'name': 'David Vonplon'}}

Update:

Apparently, the webpage has many script tags with attribute preserve. So, you can filter by other attributes.

import requests, json, re
from bs4 import BeautifulSoup

res = requests.get("https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860?reduced=true")
soup = BeautifulSoup(res.text, "html.parser")
data = json.loads(soup.find("script",attrs={"preserve":"preserve", "data-hid":re.compile("ld-json-ld*")}).get_text(strip=True))

print(data)

Output:

{'@context': 'http://schema.org', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860'}, 'headline': 'Plötzlich ist  das Klimaziel in Griffweite | NZZ', 'datePublished': '2020-04-15T12:33:47.623Z', 'dateModified': '2020-04-15T13:49:04.823Z', 'publisher': {'@type': 'Organization', 'name': 'Neue Zürcher Zeitung AG, Schweiz', 'url': 'https://www.nzz.ch', 'logo': {'@type': 'ImageObject', 'url': 'https://www.nzz.ch/logo.png', 'width': 413, 'height': 60}, 'contactPoint': [{'@type': 'ContactPoint', 'telephone': '+41-44-2581000', 'contactType': 'customer service'}], 'sameAs': ['https://www.facebook.com/nzz', 'https://www.twitter.com/nzz', 'https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ', 'https://www.linkedin.com/company/neue-zurcher-zeitung', 'https://plus.google.com/+nzz/', 'http://www.freebase.com/m/041b43']}, 'description': 'Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.', 'isAccessibleForFree': False, 'hasPart': {'@type': 'WebPageElement', 'isAccessibleForFree': False, 'cssSelector': '.regwalled'}, 'image': {'@type': 'ImageObject', 'url': 'https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg', 'width': '7050', 'height': '4705'}, 'author': {'@type': 'Person', 'name': 'David Vonplon'}}

Thank you bigbounty. With this code I'm getting: JSONDecodeError: Expecting value: line 1 column 1 (char 0). Do I maybe have a problem with a package? — Marco_CH, Aug 02 '20 at 13:48
is this the page you want to scrape - https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860? — bigbounty, Aug 02 '20 at 13:49
Yes. Having a Mining Script running, which saves every 5 minutes all new articels in the HTML format in a CSV. With a second Script im parsing out all information out of the HTML. In April it worked, now it gives me this JSONDecodeError error. — Marco_CH, Aug 02 '20 at 13:53
I'm getting this error also if I'm using your code as you've posted it in this thread. — Marco_CH, Aug 02 '20 at 13:53
@Marco_CH Updated my answer. Now scraping through the page and getting the information — bigbounty, Aug 02 '20 at 14:04
Thanks. Unfortunately the same error. "Expecting value: line 1 column 1 (char 0)". — Marco_CH, Aug 02 '20 at 14:07
Probably you are scraping some other link so your html is different — bigbounty, Aug 02 '20 at 14:08
But then your first code, where we just took the HTML, should have worked. And the HTML part I've posted is parsed out of this url. I just cannot extract the JSON part. And I'm doing exactelly the same as you. So I'm afraid there is a mistake in a package or in my Anaconda Envorionment. — Marco_CH, Aug 02 '20 at 14:12

score 1 · Accepted Answer · answered Aug 02 '20 at 15:16

1

Try

json.loads(soup.select_one('script').string)

and see if that works. It works for me for the <data> in your question.

answered Aug 02 '20 at 15:16

Jack Fleeting

24,385
6
23
45

Parsing JSON out of HTML with Beautifulsoup

2 Answers2