I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.
An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn
, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn
What I have tried so far is find all hexacodes in the file, strip off the &#x parts, and convert the hexacode into the actual character with in the € case:
s_decoded = bytes.fromhex("20AC").decode('ascii')
and
s_decoded = bytes.fromhex("20AC").decode('utf-8')
which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128)
and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte
.
I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.