"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Question

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Possible duplicate of [Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?](https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror) — tripleee, Jan 17 '19 at 09:03
We had this error with msgpack when using python 3 instead of python 2.7. For us, the course of action was to work with python 2.7. — Jesse W. Collins, Jun 05 '19 at 15:43

score 649 · Accepted Answer · edited Jan 30 '21 at 16:39

649

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

edited Jan 30 '21 at 16:39

Peter Mortensen

30,738
21
105
131

answered Oct 31 '13 at 12:35

SujitS

11,063
3
19
41

16

Explicit is better than implicit (PEP 20). – 0 _ Jul 01 '16 at 05:46
10

The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore! – Kjeld Flarup Apr 12 '18 at 08:53
1

I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/configparser.py and changed the code to the following **def read(self, filenames, encoding="ISO-8859-1"):** – Евгений Коптюбенко Sep 27 '18 at 14:18
8

Is there an automatic way of detecting encoding? – JoseOrtiz3 Jan 29 '19 at 23:20
14

@OrangeSherbet I implemented detection using `chardet`. Here's the one-liner (after `import chardet`): `chardet.detect(open(in_file, 'rb').read())['encoding']`. Check out this answer for details: https://stackoverflow.com/a/3323810/615422 – VertigoRay Mar 20 '19 at 13:34
How do you get the encoding of a file? – WJA Aug 05 '19 at 15:22
Note that `'ISO-8859-1'` will *always* work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this. – Mark Ransom May 04 '20 at 17:03
1

I like @VertigoRay's suggestion of `chardet` in a script, but for something really quick to diagnose what's going on, a simple `file` helped me: `% file list.log list.log: ISO-8859 text` vs `%file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators` – Billy Oct 12 '20 at 18:49
1

@vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question – Fred Zimmerman Sep 10 '21 at 05:02
1

@OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: [charset-normalizer](https://pypi.org/project/charset-normalizer/). – Mark Ransom Dec 05 '21 at 17:26

score 89 · Answer 2 · edited Aug 26 '21 at 16:33

89

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

edited Aug 26 '21 at 16:33

mkrieger1

19,194
5
54
65

answered Oct 26 '17 at 19:49

Ryoji Kuwae Neto

1,027
7
7

8

You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help. – RolfBly Oct 26 '17 at 20:26

score 42 · Answer 3 · edited Jan 30 '21 at 16:29

42

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

edited Jan 30 '21 at 16:29

Peter Mortensen

30,738
21
105
131

answered Oct 31 '13 at 05:58

Mark Ransom

299,747
42
398
622

7

So, How can I find out what encoding is it! I am using linux – SujitS Oct 31 '13 at 11:35
7

There is no way to do that that always works, but see the answer to this question: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – RemcoGerlich Oct 31 '13 at 12:37

score 30 · Answer 4 · edited Jan 30 '21 at 16:35

30

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

edited Jan 30 '21 at 16:35

Peter Mortensen

30,738
21
105
131

answered Jan 31 '17 at 20:35

Shashank

525
6
10

2

Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here. – Alastair McCormack Jan 07 '20 at 10:34
1

'latin-1' is the same as 'ISO-8859-1'? – Peter Mortensen Jan 30 '21 at 16:37
2

@PeterMortensen yes it is, [Wikipedia confirms it](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). They both produce the same output when used with `decode` in Python as well. – Mark Ransom Mar 09 '21 at 15:49
@AlastairMcCormack one more late comment, `'latin-1'` will always read a file without error because there are no invalid bytes in that encoding, even if it produces the wrong characters. It is the only encoding in Python with that property. – Mark Ransom Jul 21 '22 at 21:27
@MarkRansom I'm not sure about that :) What's an invalid byte in any 8bit code page? Surely, all the iso-8859 code pages will accept any byte? – Alastair McCormack Jul 22 '22 at 13:01
1

@AlastairMcCormack you're right, many of the ISO-8859 variants work with every byte value, even though they generate different characters from those bytes. But not all of them: iso8859_3 fails with 0xa5, iso8859_6 fails with 0xa1, iso8859_7 fails with 0xae, iso8859_8 fails with 0xa1, and iso8859_11 fails with 0xdb. – Mark Ransom Jul 22 '22 at 23:42
@MarkRansom I admire your dedication and tenacity to science! Thank you. Have a great weekend – Alastair McCormack Jul 23 '22 at 07:15
1

@AlastairMcCormack I'd hardly call it dedication and tenacity, you just made me curious. I'd always assumed that iso8859_1 was the only one that could decode every byte value, so I tried them all; it turns out a lot of them can do it: iso8859_2, iso8859_4, iso8859_5, iso8859_9, iso8859_10, iso8859_13, iso8859_14, iso8859_15, and iso8859_16. Latin-1 aka iso8859_1 is still unique in having a direct 1:1 correspondence between byte values and Unicode code points. – Mark Ransom Jul 23 '22 at 14:57

score 23 · Answer 5 · edited Jan 30 '21 at 16:49

23

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")

edited Jan 30 '21 at 16:49

Peter Mortensen

30,738
21
105
131

answered Feb 17 '20 at 19:45

Ayesha Siddiqa

345
2
3

1

Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though. – Mark Ransom Mar 09 '21 at 15:51

score 17 · Answer 6 · edited Jan 30 '21 at 16:36

17

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

edited Jan 30 '21 at 16:36

Peter Mortensen

30,738
21
105
131

answered Mar 03 '17 at 17:32

Jeril

7,858
3
52
69

4

But this is version 3 – SujitS Mar 03 '17 at 17:40
2

Yeah I know. I thought it might be helpful for the people using `Python 2` – Jeril Mar 03 '17 at 18:06
1

Worked for me in Python3 as well – fenkerbb Sep 27 '17 at 21:19
3

In case you want something easier to remember, `'ISO-8859-1'` is also known as `'latin-1'` or `'latin1'`. – Max Candocia Jan 11 '18 at 15:54

score 16 · Answer 7 · edited Jan 30 '21 at 16:48

16

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.

edited Jan 30 '21 at 16:48

Peter Mortensen

30,738
21
105
131

answered May 02 '19 at 02:15

Ozcar Nguyen

179
1
6

score 9 · Answer 8 · edited Jan 30 '21 at 16:50

9

You can try this way:

open('u.item', encoding='utf8', errors='ignore')

edited Jan 30 '21 at 16:50

Peter Mortensen

30,738
21
105
131

answered May 23 '20 at 19:53

Farid Chowdhury

2,766
1
26
21

1

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/26211981) – MartenCatcher May 24 '20 at 06:04
@MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment – Silidrone Nov 28 '20 at 18:14
2

What is the intent? Ignoring errors? What are the consequences? – Peter Mortensen Jan 30 '21 at 16:51
it's useful if you don't know the source encoding, and you just want to get a "close enough" string. for example, output of Powershell commands in different OS locales, where you just want to search for a specific language-independent string, you don't actually care what the rest of the text says – yossiz74 Aug 16 '22 at 12:04

score 9 · Answer 9 · answered Oct 13 '21 at 12:46

9

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

answered Oct 13 '21 at 12:46

Vineet Singh

309
5
12

I've tried the below code. – Piumi ganegoda Nov 12 '22 at 17:19

score 6 · Answer 10 · answered Aug 30 '21 at 05:19

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence . — Pooya Estakhri, Aug 30 '21 at 12:46
I don't get it, why would you use a 33-line program to avoid typing one line in the shell? — Mark Ransom, Oct 17 '21 at 03:10
Also there are ways to do it within Python itself, without relying on an external utility. See for example https://stackoverflow.com/q/436220/5987 — Mark Ransom, Jul 23 '22 at 14:18

score 5 · Answer 11 · edited Jan 30 '21 at 16:34

5

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

edited Jan 30 '21 at 16:34

Peter Mortensen

30,738
21
105
131

answered Sep 14 '16 at 19:24

user6832484

51
1
1

Kalluri · Answer 12 · 2022-05-23T13:18:31.850

4

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

edited May 23 '22 at 13:18

answered May 23 '22 at 13:18

Kalluri

41
2

score 3 · Answer 13 · answered Dec 07 '21 at 06:40

3

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):

print(line)

answered Dec 07 '21 at 06:40

Anoop Ashware

39
3

score 2 · Answer 14 · edited Jan 30 '21 at 16:46

2

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)

edited Jan 30 '21 at 16:46

Peter Mortensen

30,738
21
105
131

answered Aug 29 '18 at 03:58

xtluo

1,961
18
26

How would opening a file that doesn't exist generate a `UnicodeDecodeError`? And in Python it's customary to use [the EAFP principle](https://stackoverflow.com/q/11360858/5987) over the LBYL that you're endorsing here. – Mark Ransom Oct 17 '21 at 03:18

score 2 · Answer 15 · edited Jan 30 '21 at 16:53

2

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

edited Jan 30 '21 at 16:53

Peter Mortensen

30,738
21
105
131

answered Jan 22 '21 at 09:46

JGaber

37
3

Notepad++ is [Windows](https://en.wikipedia.org/wiki/Microsoft_Windows) only. For example, it doesn't work on [Linux](https://en.wikipedia.org/wiki/Linux). – Peter Mortensen Jan 30 '21 at 16:53
What is *"Encodage"*? What language? – Peter Mortensen Jan 30 '21 at 16:55
"Encodage" is "Encoding" if the menu is in French – JGaber Apr 30 '22 at 09:48

score 1 · Answer 16 · edited Aug 03 '21 at 11:15

1

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

edited Aug 03 '21 at 11:15

Eric Aya

69,473
35
181
253

answered Aug 03 '21 at 05:39

Nikita Axenov

11
1

Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet. – Thierry Lathuille Aug 03 '21 at 10:01
1

@ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue? – Nikita Axenov Aug 03 '21 at 10:13
1

This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example https://meta.stackoverflow.com/questions/297673/how-do-i-deal-with-non-english-content ), and the rule is really strictly respected. For questions in Russian, you have https://ru.stackoverflow.com/ , though ;) – Thierry Lathuille Aug 03 '21 at 10:20
@ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark). – Anonymous Aug 03 '21 at 19:52

score 1 · Answer 17 · answered Jun 09 '22 at 07:08

I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:

import pandas as pd

file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

A link to the docs is here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

score 0 · Answer 18 · answered Mar 16 '22 at 16:31

0

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

answered Mar 16 '22 at 16:31

SONY ANNEM

1

Nobody said that the file in the question is a csv file. – Arpad Horvath -- Слава Україні Mar 16 '22 at 20:21

score 0 · Answer 19 · answered Aug 03 '22 at 19:25

In my case, this issue occurred because I modified the extension of an excel file (.xlsx) directly into a (.csv) file directly...

The solution was to open the file then save it as new (.csv) file (i.e. file -> save as -> select the (.csv) extension and save it. This worked for me.

score 0 · Answer 20 · answered Sep 28 '22 at 16:04

My issue was similar in that UTF-8 text was getting passed to the Python script.

In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.

Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT() function in my SELECT query in the @input_data parameter.

So, while this query

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));

gives the error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data

Using CONVERT(type, data) (CAST(data AS type) would also work)

EXEC sp_execute_external_script @language = N'Python', 
@script = N'
OutputDataSet = InputDataSet
', 
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));

returns

id  text
1   Ç

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

20 Answers20

print(line)

Linked

Related