Convert HTML entities in plain text to characters

Question

I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.

An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn

What I have tried so far is find all hexacodes in the file, strip off the &#x parts, and convert the hexacode into the actual character with in the € case:

s_decoded = bytes.fromhex("20AC").decode('ascii')

and

s_decoded = bytes.fromhex("20AC").decode('utf-8')

which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128) and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte.

I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.

Also, the question should be to convert html entitites to string not the hexacode to string. Please correct it too. — Abhi, Feb 24 '22 at 09:01
Thank you for that. Just for my own sanity, is ```20AC``` not hexacode? I'm fairly new to this encoding, so don't have a good understanding yet. — whaddaplaya, Feb 24 '22 at 09:07
Also, now that I know these are called HTML entities, I found another SO question where this is answered, making this a duplicate. Any way I can update that? https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string — whaddaplaya, Feb 24 '22 at 09:09

score 1 · Accepted Answer · answered Feb 24 '22 at 08:41

1

To decode HTML Entities like of your example you could use the following code.

html_encoded = 'Motorists could be charged for every mile they drive to raise &#x20AC;35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)

answered Feb 24 '22 at 08:41

Abhi

1,080
1
7
21

Convert HTML entities in plain text to characters

1 Answers1

Linked