I want to scrape and parse a London Stock Exchange news article.
Almost the entire content of the site comes from a JSON
that's consumed by JavaScript
. However, this can be easily extracted with BeautifulSoup
and parsed with the JSON
module.
But the encoding of the script is a bit funky.
The <script>
tag has an id
of "ng-lseg-state"
, which means this is Angular's custom HTML encoding.
For example:
&l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xhtml\"&g;\n&l;head&g;\n&l;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /&g;\n&l;title&g;&l;/title&g;\n&l;meta name=\"generator\"
I handle this with a .replace()
chain:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
article_body
.replace('&l;', '<')
.replace('&g;', '>')
.replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p"))
But there are still some characters that I'm not sure how to handle:
&a;#160;
&a;amp;
&s;
just to name a few.
So, the question is, how do I deal with the rest of the chars? Or maybe there's a parser or a reliable char mapping out there that I don't know of?