I have some scraped data that I have output into a html file held locally as a 'raw' version before I do some data manipulation.
The issue is when I process the website I have a troublesome time dealing with a "'"
character.
After much research I am getting to the end of my tether. I have seen much on that apostrophe causing issues, I have tried many versions of encoding and decoding, chardet etc and still cannot get it to work.
A word in a few tables is : CA’BELLAVISTA
When I process a script the IDE screen prints it correctly after I get the right encoding/decoding pattern however when I view the outputted HTML file I get the following CA\x92BELLAVISTA
every time.
The script is simply a urllib.response.read()
then encoding.
Is it the web browser doing it or is the script genially not getting the correct character?
The next step involves me loading in the HTML file for further manipulation and output to JSON/csv so I thought nailing the encoding on html file output would be the best option.
I think it's a ISO-9959-1/Latin1 charset although that seems to change on the odd webpage. I hope I'm doing the correct thing in trying to put it into UTF-8.