I am reading a HTML document using Python. It has many characters like \x93
, \x94
, \xa0
. I presume they correspond to latin-1 supplement encoding. Is there a library that deals with this?
Asked
Active
Viewed 1,521 times
-1

OlorinIstari
- 537
- 5
- 20
-
Can you also post the code you are using and the error you are getting? – ksohan May 22 '20 at 13:34
-
maybe you need only `decode('latin1')` or even `open(... ,encoding='latin1')` – furas May 22 '20 at 13:36
-
I am not getting any error. When I download the file and read the file in python using ```utf-8``` encoding, and print it, I can see occurences of ```\x93``` etc. I have also tried reading using other encoding schemes – OlorinIstari May 22 '20 at 13:36
-
first show url to file which you downloaded. And show code which you use to download it. Usually HTML pages have information about encoding and you don't have to encode it manulally. Next check in Google in which encoding chars have codes `\x93`, `\x94`, `\xa0` - and you will know if it is really `latin1` or something else. – furas May 22 '20 at 13:40
-
Using Google I found [Python: Removing \xa0 from string?](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string). You should learn to use Google before you ask. – furas May 22 '20 at 13:43
1 Answers
0
You can simply encode and decode strings in latin1 in python: string.decode('latin1')

Kyrylo Kundik
- 64
- 3
-
Hi. Thanks! That seems to work. But I still have occurences of ```\xa0```. By any chance do you know what encoding that comes under? – OlorinIstari May 22 '20 at 13:38
-
@ShrutheeshRamanIyer to answer in what encoding is `\xa0` we would have to use Google - but you could use Google on your own. – furas May 22 '20 at 13:41
-
`\xa0` is actually non-breaking space in Latin1 (ISO 8859-1), also `chr(160)`. You should replace it with space. From https://stackoverflow.com/a/11566398/12181022 – Kyrylo Kundik May 22 '20 at 13:42