0

enter image description hereI am trying to open a basic file.txt file which is located in the same CWD as my python interpreter.

So I do a=open("file.txt","r")

Then I want to display its content (there's only one test line like hello world in it)

So I do content=a.read()

So you know, when I put a enter, I have this:

a
<_io.TextIOWrapper name='fichier.txt' mode='r' encoding='UTF-8'>

Then I have an error I don't understand. Does someone have an idea on how to fix this ?

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    contenu=a.read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 15: invalid continuation byte
  • 2
    Can you show us what exactly your file contains? This error indicates that there's an invalid character in the file - specifically, the fifteenth character in the file. Fix that and this should run properly. – Green Cloak Guy May 21 '19 at 13:10
  • can you run `file -I fichier.txt` in the terminal and tell us the output? – Boris Verkhovskiy May 21 '19 at 13:14
  • Ok so I did a new doc with the .rtf extension. The text inside is "this file is vanilla. It only contains letters and dots.". Now python seems to read it, but doesn't display properly what's inside. Instead, I see – Sébastien Chabrol May 21 '19 at 14:45
  • '{\\rtf1\\ansi\\ansicpg1252\\cocoartf1671\\cocoasubrtf200\n{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;}\n{\\colortbl;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;}\n\\paperw11900\\paperh16840\\margl1440\\margr1440\\vieww10800\\viewh8400\\viewkind0\n\\pard\\tx566\\tx1133\\tx1700\\tx2267\\tx2834\\tx3401\\tx3968\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural\\partightenfactor0\n\n\\f0\\fs24 \\cf0 this file is vanilla. It only contains letters and dots.}' – Sébastien Chabrol May 21 '19 at 14:46
  • It's always good to try to read a regular txt file with basic characters to see if there are some issues with the content. – Adam Zaft May 21 '19 at 13:23
  • Well I tried the same with a docx file containing the same content. – Sébastien Chabrol May 21 '19 at 14:55
  • This time I have the error again saying Traceback (most recent call last): File "", line 1, in b=a.read() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte – Sébastien Chabrol May 21 '19 at 14:56
  • @SébastienChabrol can you run the command I said? Can you also run `xxd fichier.txt` and put the contents on pastebin or edit them into your question? – Boris Verkhovskiy May 22 '19 at 00:17
  • Opening a .rtf and a .docx as a raw text file isn't going to work. Unlike a .txt file that contains only text, those files contain a bunch of information besides the text (like where to render it, in what font) which should be read and parsed by a library, like for example [PyPDF2](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) or [textract](https://textract.readthedocs.io/en/stable/). – Boris Verkhovskiy May 22 '19 at 00:22

2 Answers2

1

Your file is probably not encoded in UTF-8. Try:

from chardet import detect

with open("file.txt", "rb") as infile:
    raw = infile.read()

    encoding = detect(raw)['encoding']  
    print(encoding)
gaFF
  • 747
  • 3
  • 11
  • You have to `pip3 install chardet` first – Boris Verkhovskiy May 21 '19 at 13:12
  • I don't have the "chardet" package. I have really no idea how to download it and link it to my python. I'm a beginner btw – Sébastien Chabrol May 21 '19 at 13:35
  • Open up Terminal.app. You are now in a bash prompt. It's similar to the python prompt in your screenshot but it's used for controlling your computer. When you install Python it installs a bash command called `pip3` (the 3 is because this is Python 3) which is used for installing packages. To install a package you type `pip3 install `. In this case, `pip3 install chardet`. If that runs successfully, when you're back in the python prompt, you can do `import chardet` or `from chardet import detect`. – Boris Verkhovskiy May 22 '19 at 00:31
0

Your file is not encoded in UTF-8. The encoding is controlled by the tool used to create the file. Make sure you use the right encoding.

Here's an example:

>>> s = 'Sébastien Chabrol'
>>> s.encode('utf8')             # é in UTF-8 is encoded as bytes C3 A9.
b'S\xc3\xa9bastien Chabrol'
>>> s.encode('cp1252')           # é in cp1252 is encoded as byte E9.
b'S\xe9bastien Chabrol'
>>> s.encode('utf8').decode('1252')  # decoding incorrectly can produce wrong characters...
'Sébastien Chabrol'
>>> s.encode('cp1252').decode('utf8') # or just fail.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

If using Python 3, you can provide the encoding when you open the file:

a = open('file.txt','r',encoding='utf8')

On Python 2 or 3, you can also use the backward-compatible syntax:

import io
a = io.open('file.txt','r',encoding='utf8')

If you have no idea of the encoding, you can open in binary mode to see the raw byte content and at least make a guess:

a = open('file.txt','rb')
print(a.read())

Read more about Python and encodings here: https://nedbatchelder.com/text/unipain.html

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251