-1

I can't properly encode and decode a string that contains single and double quotation marks. Note: I need to show quotation marks.

I saved following string in a txt file.

Here’s their mantra: “Eat less and exercise more. The secret to weight loss is energy balance. There are no good or bad calories. It’s all about moderation.” 

with open ("file.txt", "r") as myfile:
    data = myfile.read()
    myfile.close()

print data
the result:

HereΓÇÖs their mantra: ΓÇ£Eat less and exercise more. The secret to weight loss is energy balance. There are no good or bad calories. ItΓÇÖs all about moderation.ΓÇ¥ 

I can fully omit quotation marks, but I need to show them

print data.decode('ascii', 'ignore') 

Heres their mantra: Eat less and exercise more. The secret to weight loss is energy balance. There are no good or bad calories. Its all about moderation.

print json.dumps(data)

"\ufeff\nHere\u2019s their mantra: \u201cEat less and exercise more. The secret to weight loss is energy balance. There are no good or bad calories. It\u2019s all about moderation.\u201d "
Anay Bose
  • 880
  • 1
  • 14
  • 24
  • Your console or terminal encoding doesn't support UTF-8 (the encoding of your input file). Your console uses cp437 instead. – Martijn Pieters Dec 14 '16 at 14:46
  • So, what should I do? – Anay Bose Dec 14 '16 at 14:48
  • 2
    What are you *trying* to do? Your console encoding doesn't support the 'fancy' quotes in the text; you could replace these with the ASCII equivalents, or you could alter your console encoding. – Martijn Pieters Dec 14 '16 at 14:49
  • The problem is not the presence of single and double quotation marks, it's the presence of *non-ASCII* quotation marks. – Rory Daulton Dec 14 '16 at 14:49
  • 2
    Or you could upgrade to Python 3.6 (nearly out), open the file as UTF-8 with `open('file.txt', 'r', encoding='utf8')` and printing then uses the Microsoft wide APIs to bypass the whole Windows codepage mess altogether. – Martijn Pieters Dec 14 '16 at 14:50
  • @MartijnPieters, I have toggled my console debugger encoding to UTF-8 and the problem vanishes. Nice catch. – Anay Bose Dec 14 '16 at 15:01

1 Answers1

2

Your file isn't ASCII. You seem to realize that, because you explicitly told it to ignore decoding errors.

It looks like the file is UTF-8, and Python is printing the UTF-8 encoding of the unicode object, which Windows is then interpreting through the console's default code page (on my system, cp437, an ASCII superset that provides a bunch of console drawing symbols as character). To fix, decode it properly:

print data.decode('utf-8')

Alternatively, you can use the Python 3 open function, even in Python 2.7, by importing io and using io.open, which will let you specify the encoding and perform decoding for you automatically and seamlessly:

from __future__ import print_function  # Including the __future__ import makes
                                       # this 100% Py2/Py3 compatible
import io

with io.open("file.txt", encoding="utf-8") as myfile:
    data = myfile.read()
print(data)

If you're on Windows, your command prompt probably won't support arbitrary Unicode output unfortunately, and there is no 100% solution, but running

chcp 65001

in your cmd.exe prompt before launching the Python program will make it use UTF-8 as the console "code page" (how it interprets the raw bytes output by Python). There are bugs with that code page (search to learn more), but it's the closest you're going to get. You also need a Unicode friendly console font, more details here.

An alternative solution to manual code page manipulation (or moving to Python 3.6, which bypasses the code page issues entirely) is to use a package like unidecode to convert non-ASCII characters to their ASCII equivalents (so Unicode smart quotes become plain ASCII straight quotes). There is plenty of information on using unidecode available elsewhere, which I'll avoid regurgitating here.

Community
  • 1
  • 1
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • the problem still persists. – Anay Bose Dec 14 '16 at 14:50
  • At which you'll get a UnicodeEncodeError because their console is configured for CP437. – Martijn Pieters Dec 14 '16 at 14:51
  • @AnayBose: no, you get a *different* error now. That's not the same problem. – Martijn Pieters Dec 14 '16 at 14:51
  • @MartijnPieters: I was editing in the `chcp` bit when you commented. :-) – ShadowRanger Dec 14 '16 at 14:52
  • @ShadowRanger: using `chcp` has issues too; the console font needs to be updated to support the Unicode characters too. – Martijn Pieters Dec 14 '16 at 14:58
  • @MartijnPieters: Hmm... True. I tested with a character I knew the Alt code for (é), which happens to be in the default Raster Fonts; smart quotes aren't. So yeah, you'd need to edit the `cmd.exe` session properties (or change the defaults for all future `cmd.exe` sessions) to make it use Consolas or Lucida Console fonts (the former actually displays the smart quotes differently from straight quotes). Even then, it's not 100%; in testing the output of paired smart quotes, it's outputting a couple garbage characters on the following line. So really, just get Python 3.6, or UNIX-like OS. :-) – ShadowRanger Dec 14 '16 at 15:10