0

I have some text encoded in UTF-8. 'Before – after.' It was fetched from the web. The '–' character is the issue. If you try to print directly from the command line, using copy and paste:

>>> text = 'Before – after.'
>>> print text
Before – after.

But if you save to a text file and try to print:

>>> for line in ('file.txt','r'):
>>>     print line
Before û after.

Im pretty sure this is some sort of UTF-8 encode/decode error, but it is eluding me. I have tried to decode, or re-encode but that is not it either.

>>> for line in ('file.txt','r'):
>>>     print line.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte

>>> for line in ('file.txt','r'):
>>>     print line.encode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte
Dbricks
  • 83
  • 9
  • did you try decoding it and then printing it out? What exactly are you trying to do? – RPT Apr 30 '17 at 15:22
  • just trying to fetch text from the web, save to txt document for processing later. When printing, it gets corrupted. – Dbricks Apr 30 '17 at 15:24
  • did you try decoding it? `line = line.decode('utf-8');` – RPT Apr 30 '17 at 15:25
  • Yes:`line=line.decode('utf-8')`; `UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte` – Dbricks Apr 30 '17 at 15:27
  • How did you save it to a text file? How do you know you saved it as utf-8? I saved your string as utf-8 and it worked for me. Your method of decoding each line should work. – tdelaney Apr 30 '17 at 15:51
  • The reason why it works on the command line is a bit odd and a quirk of how unicode works on python 2 That "string" is really an encoded byte sequence. It looks right because its in the same encoding as your terminal. What is `sys.stdout.encoding` on your system? – tdelaney Apr 30 '17 at 15:53
  • As an aside, python 3 has been out for many years and has much more robust unicode support. Python 2 is legacy and should only be used when required. – tdelaney Apr 30 '17 at 15:55

1 Answers1

0

It's happening because a non-ascii character cannot be encoded or decoded. You can strip it out and then print the ascii values. Take a look at this question : UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

Community
  • 1
  • 1
RPT
  • 728
  • 1
  • 10
  • 29
  • So why would it work with a direct cut and paste approach? >>> text = `'Before - after.' >>> print text Before - after. >>> text.encode('utf-8') 'Before - after.'` – Dbricks Apr 30 '17 at 15:36
  • non-ascii text can be decoded. In python 2, the read got a binary string and it should be decoded. The error suggests that OP guessed the wrong encoding. – tdelaney Apr 30 '17 at 15:54
  • Read the first example : https://docs.python.org/3/howto/unicode.html Should be of use. – RPT Apr 30 '17 at 16:01