365

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
SujitS
  • 11,063
  • 3
  • 19
  • 41

20 Answers20

649

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
SujitS
  • 11,063
  • 3
  • 19
  • 41
  • 16
    Explicit is better than implicit (PEP 20). – 0 _ Jul 01 '16 at 05:46
  • 10
    The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore! – Kjeld Flarup Apr 12 '18 at 08:53
  • 1
    I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/configparser.py and changed the code to the following **def read(self, filenames, encoding="ISO-8859-1"):** – Евгений Коптюбенко Sep 27 '18 at 14:18
  • 8
    Is there an automatic way of detecting encoding? – JoseOrtiz3 Jan 29 '19 at 23:20
  • 14
    @OrangeSherbet I implemented detection using `chardet`. Here's the one-liner (after `import chardet`): `chardet.detect(open(in_file, 'rb').read())['encoding']`. Check out this answer for details: https://stackoverflow.com/a/3323810/615422 – VertigoRay Mar 20 '19 at 13:34
  • How do you get the encoding of a file? – WJA Aug 05 '19 at 15:22
  • Note that `'ISO-8859-1'` will *always* work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this. – Mark Ransom May 04 '20 at 17:03
  • 1
    I like @VertigoRay's suggestion of `chardet` in a script, but for something really quick to diagnose what's going on, a simple `file` helped me: `% file list.log list.log: ISO-8859 text` vs `%file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators` – Billy Oct 12 '20 at 18:49
  • 1
    @vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question – Fred Zimmerman Sep 10 '21 at 05:02
  • 1
    @OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: [charset-normalizer](https://pypi.org/project/charset-normalizer/). – Mark Ransom Dec 05 '21 at 17:26
89

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Ryoji Kuwae Neto
  • 1,027
  • 7
  • 7
  • 8
    You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help. – RolfBly Oct 26 '17 at 20:26
42

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 7
    So, How can I find out what encoding is it! I am using linux – SujitS Oct 31 '13 at 11:35
  • 7
    There is no way to do that that always works, but see the answer to this question: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – RemcoGerlich Oct 31 '13 at 12:37
30

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shashank
  • 525
  • 6
  • 10
  • 2
    Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here. – Alastair McCormack Jan 07 '20 at 10:34
  • 1
    'latin-1' is the same as 'ISO-8859-1'? – Peter Mortensen Jan 30 '21 at 16:37
  • 2
    @PeterMortensen yes it is, [Wikipedia confirms it](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). They both produce the same output when used with `decode` in Python as well. – Mark Ransom Mar 09 '21 at 15:49
  • @AlastairMcCormack one more late comment, `'latin-1'` will always read a file without error because there are no invalid bytes in that encoding, even if it produces the wrong characters. It is the only encoding in Python with that property. – Mark Ransom Jul 21 '22 at 21:27
  • @MarkRansom I'm not sure about that :) What's an invalid byte in any 8bit code page? Surely, all the iso-8859 code pages will accept any byte? – Alastair McCormack Jul 22 '22 at 13:01
  • 1
    @AlastairMcCormack you're right, many of the ISO-8859 variants work with every byte value, even though they generate different characters from those bytes. But not all of them: iso8859_3 fails with 0xa5, iso8859_6 fails with 0xa1, iso8859_7 fails with 0xae, iso8859_8 fails with 0xa1, and iso8859_11 fails with 0xdb. – Mark Ransom Jul 22 '22 at 23:42
  • @MarkRansom I admire your dedication and tenacity to science! Thank you. Have a great weekend – Alastair McCormack Jul 23 '22 at 07:15
  • 1
    @AlastairMcCormack I'd hardly call it dedication and tenacity, you just made me curious. I'd always assumed that iso8859_1 was the only one that could decode every byte value, so I tried them all; it turns out a lot of them can do it: iso8859_2, iso8859_4, iso8859_5, iso8859_9, iso8859_10, iso8859_13, iso8859_14, iso8859_15, and iso8859_16. Latin-1 aka iso8859_1 is still unique in having a direct 1:1 correspondence between byte values and Unicode code points. – Mark Ransom Jul 23 '22 at 14:57
23

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ayesha Siddiqa
  • 345
  • 2
  • 3
  • 1
    Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though. – Mark Ransom Mar 09 '21 at 15:51
17

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jeril
  • 7,858
  • 3
  • 52
  • 69
16

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ozcar Nguyen
  • 179
  • 1
  • 6
9

You can try this way:

open('u.item', encoding='utf8', errors='ignore')
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Farid Chowdhury
  • 2,766
  • 1
  • 26
  • 21
  • 1
    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/26211981) – MartenCatcher May 24 '20 at 06:04
  • @MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment – Silidrone Nov 28 '20 at 18:14
  • 2
    What is the intent? Ignoring errors? What are the consequences? – Peter Mortensen Jan 30 '21 at 16:51
  • it's useful if you don't know the source encoding, and you just want to get a "close enough" string. for example, output of Powershell commands in different OS locales, where you just want to search for a specific language-independent string, you don't actually care what the rest of the text says – yossiz74 Aug 16 '22 at 12:04
9

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

Vineet Singh
  • 309
  • 5
  • 12
6

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
Alain Cherpin
  • 305
  • 4
  • 5
  • I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence . – Pooya Estakhri Aug 30 '21 at 12:46
  • 1
    I don't get it, why would you use a 33-line program to avoid typing one line in the shell? – Mark Ransom Oct 17 '21 at 03:10
  • Also there are ways to do it within Python itself, without relying on an external utility. See for example https://stackoverflow.com/q/436220/5987 – Mark Ransom Jul 23 '22 at 14:18
5

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user6832484
  • 51
  • 1
  • 1
4

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

Kalluri
  • 41
  • 2
3

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):

print(line)

2

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
xtluo
  • 1,961
  • 18
  • 26
  • How would opening a file that doesn't exist generate a `UnicodeDecodeError`? And in Python it's customary to use [the EAFP principle](https://stackoverflow.com/q/11360858/5987) over the LBYL that you're endorsing here. – Mark Ransom Oct 17 '21 at 03:18
2

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
JGaber
  • 37
  • 3
1

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
  • Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet. – Thierry Lathuille Aug 03 '21 at 10:01
  • 1
    @ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue? – Nikita Axenov Aug 03 '21 at 10:13
  • 1
    This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example https://meta.stackoverflow.com/questions/297673/how-do-i-deal-with-non-english-content ), and the rule is really strictly respected. For questions in Russian, you have https://ru.stackoverflow.com/ , though ;) – Thierry Lathuille Aug 03 '21 at 10:20
  • @ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark). – Anonymous Aug 03 '21 at 19:52
1

I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:

import pandas as pd

file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

A link to the docs is here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

D.L
  • 4,339
  • 5
  • 22
  • 45
0

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

0

In my case, this issue occurred because I modified the extension of an excel file (.xlsx) directly into a (.csv) file directly...

The solution was to open the file then save it as new (.csv) file (i.e. file -> save as -> select the (.csv) extension and save it. This worked for me.

afrah
  • 41
  • 3
0

My issue was similar in that UTF-8 text was getting passed to the Python script.

In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.

Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT() function in my SELECT query in the @input_data parameter.

So, while this query

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));

gives the error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data

Using CONVERT(type, data) (CAST(data AS type) would also work)

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));

returns

id  text
1   Ç
Mark Smith
  • 104
  • 1
  • 5