Reading and writing UTF-8 from file

Question

I have some text encoded in UTF-8. 'Before – after.' It was fetched from the web. The '–' character is the issue. If you try to print directly from the command line, using copy and paste:

>>> text = 'Before – after.'
>>> print text
Before – after.

But if you save to a text file and try to print:

>>> for line in ('file.txt','r'):
>>>     print line
Before û after.

Im pretty sure this is some sort of UTF-8 encode/decode error, but it is eluding me. I have tried to decode, or re-encode but that is not it either.

>>> for line in ('file.txt','r'):
>>>     print line.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte

>>> for line in ('file.txt','r'):
>>>     print line.encode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte

did you try decoding it and then printing it out? What exactly are you trying to do? — RPT, Apr 30 '17 at 15:22
just trying to fetch text from the web, save to txt document for processing later. When printing, it gets corrupted. — Dbricks, Apr 30 '17 at 15:24
Yes:`line=line.decode('utf-8')`; `UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte` — Dbricks, Apr 30 '17 at 15:27
How did you save it to a text file? How do you know you saved it as utf-8? I saved your string as utf-8 and it worked for me. Your method of decoding each line should work. — tdelaney, Apr 30 '17 at 15:51
The reason why it works on the command line is a bit odd and a quirk of how unicode works on python 2 That "string" is really an encoded byte sequence. It looks right because its in the same encoding as your terminal. What is `sys.stdout.encoding` on your system? — tdelaney, Apr 30 '17 at 15:53
As an aside, python 3 has been out for many years and has much more robust unicode support. Python 2 is legacy and should only be used when required. — tdelaney, Apr 30 '17 at 15:55

score 0 · Answer 1 · edited May 23 '17 at 12:34

0

It's happening because a non-ascii character cannot be encoded or decoded. You can strip it out and then print the ascii values. Take a look at this question : UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

edited May 23 '17 at 12:34

Community

1
1

answered Apr 30 '17 at 15:29

RPT

728
1
10
29

So why would it work with a direct cut and paste approach? >>> text = `'Before - after.' >>> print text Before - after. >>> text.encode('utf-8') 'Before - after.'` – Dbricks Apr 30 '17 at 15:36
non-ascii text can be decoded. In python 2, the read got a binary string and it should be decoded. The error suggests that OP guessed the wrong encoding. – tdelaney Apr 30 '17 at 15:54
Read the first example : https://docs.python.org/3/howto/unicode.html Should be of use. – RPT Apr 30 '17 at 16:01

Reading and writing UTF-8 from file

1 Answers1