1

I want to scrape and parse a London Stock Exchange news article.

Almost the entire content of the site comes from a JSON that's consumed by JavaScript. However, this can be easily extracted with BeautifulSoup and parsed with the JSON module.

But the encoding of the script is a bit funky.

The <script> tag has an id of "ng-lseg-state", which means this is Angular's custom HTML encoding.

For example:

&l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xhtml\"&g;\n&l;head&g;\n&l;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /&g;\n&l;title&g;&l;/title&g;\n&l;meta name=\"generator\"

I handle this with a .replace() chain:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
    article_body
    .replace('&l;', '<')
    .replace('&g;', '>')
    .replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p"))

But there are still some characters that I'm not sure how to handle:

  • &amp;a;#160;
  • &amp;a;amp;
  • &amp;s;

just to name a few.

So, the question is, how do I deal with the rest of the chars? Or maybe there's a parser or a reliable char mapping out there that I don't know of?

baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    Related: https://stackoverflow.com/questions/62127215/beautiful-soup-how-to-decode-html-json-data-in-script-object Same problem, same website. – QHarr Apr 10 '21 at 21:57
  • 2
    Thanks @QHarr for pointing that one out. Seems like we could all benefit from a more generic solution than a chain of `.replace()` methods. – baduker Apr 10 '21 at 22:06

1 Answers1

2

Angular encodes transfer state using a special escape function located here:

export function escapeHtml(text: string): string {
  const escapedText: {[k: string]: string} = {
    '&': '&a;',
    '"': '&q;',
    '\'': '&s;',
    '<': '&l;',
    '>': '&g;',
  };
  return text.replace(/[&"'<>]/g, s => escapedText[s]);
}

export function unescapeHtml(text: string): string {
  const unescapedText: {[k: string]: string} = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
  };
  return text.replace(/&[^;]+;/g, s => unescapedText[s]);
}

You can reproduce the unescapeHtml function in python, and add html.unescape to resolve additionnal html entities:

import json
import requests
from bs4 import BeautifulSoup
import html

unescapedText = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
}

def unescape(str):
    for key, value in unescapedText.items():
        str = str.replace(key, value)
    return html.unescape(str)

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {
    "id": "ng-lseg-state"
})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p"))

you were missing &s; and &a;

repl.it: https://replit.com/@bertrandmartel/AngularTransferStateDecode

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159