0

I have some scraped data that I have output into a html file held locally as a 'raw' version before I do some data manipulation.

The issue is when I process the website I have a troublesome time dealing with a "'" character.

After much research I am getting to the end of my tether. I have seen much on that apostrophe causing issues, I have tried many versions of encoding and decoding, chardet etc and still cannot get it to work.

A word in a few tables is : CA’BELLAVISTA

When I process a script the IDE screen prints it correctly after I get the right encoding/decoding pattern however when I view the outputted HTML file I get the following CA\x92BELLAVISTA every time.

The script is simply a urllib.response.read() then encoding.

Is it the web browser doing it or is the script genially not getting the correct character?

The next step involves me loading in the HTML file for further manipulation and output to JSON/csv so I thought nailing the encoding on html file output would be the best option.

I think it's a ISO-9959-1/Latin1 charset although that seems to change on the odd webpage. I hope I'm doing the correct thing in trying to put it into UTF-8.

Chris
  • 69
  • 7
  • I for some reason am not able to understand the issue – Tarun Lalwani Jan 25 '18 at 15:38
  • That `’` char is a RIGHT SINGLE QUOTATION MARK, U+2019. In UTF-8 it's `b'\xe2\x80\x99'`. It doesn't exist in the Latin-1 encoding, but it does exist in the (notorious) Windows cp1252, where it has the encoding `b'\x92'`. – PM 2Ring Jan 25 '18 at 15:39
  • 1
    So when you get the `bytes` data from `response.read()` you need to decode it using the `'cp1252'` codec. And wheb you're saving it to HTML encode to UTF-8 (and make sure that the HTML has a matching `` tag). – PM 2Ring Jan 25 '18 at 15:48
  • @PM 2Ring thanks so much. I had actually used the `'cp1252' towards the start of my problems but never went back to it once i had things working better. Done the trick now! – Chris Jan 25 '18 at 16:08
  • possible duplicate of https://stackoverflow.com/questions/15564063/apostrophe-turning-into-x92 – Mike Z Jan 25 '18 at 18:26
  • Possible duplicate of [apostrophe turning into \x92](https://stackoverflow.com/questions/15564063/apostrophe-turning-into-x92) – Alexei Levenkov Jan 28 '18 at 06:13

0 Answers0