1

I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.

An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn

What I have tried so far is find all hexacodes in the file, strip off the &#x parts, and convert the hexacode into the actual character with in the € case:

s_decoded = bytes.fromhex("20AC").decode('ascii')

and

s_decoded = bytes.fromhex("20AC").decode('utf-8')

which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128) and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte.

I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.

  • Also, the question should be to convert html entitites to string not the hexacode to string. Please correct it too. – Abhi Feb 24 '22 at 09:01
  • Thank you for that. Just for my own sanity, is ```20AC``` not hexacode? I'm fairly new to this encoding, so don't have a good understanding yet. – whaddaplaya Feb 24 '22 at 09:07
  • Also, now that I know these are called HTML entities, I found another SO question where this is answered, making this a duplicate. Any way I can update that? https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string – whaddaplaya Feb 24 '22 at 09:09
  • I have done it, hope it will reflect soon :) – Abhi Feb 24 '22 at 09:12

1 Answers1

1

To decode HTML Entities like of your example you could use the following code.

html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)
Abhi
  • 1,080
  • 1
  • 7
  • 21