57

I am trying to read in a dataset called df1, but it does not work

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Tuyen
  • 977
  • 1
  • 8
  • 23

5 Answers5

97

The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:

b'Korea, Dem. People\x92s Rep.'

Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, :

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                  sep=";", encoding='cp1252')

Demo:

>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
...                   sep=";", encoding='cp1252')
>>> df1.head()
                   2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

   2010  2011  2012  2013  Unnamed: 15  2014  2015
0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3    ..    ..    ..    ..          NaN    ..    ..
4    ..    ..    ..    ..          NaN    ..    ..

I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:

>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'

This is a known bug in Pandas. You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:

>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 5
    Hi Martijin How could you know its encoding is cp1252? – Tuyen Sep 01 '17 at 12:31
  • 6
    @Tuyen: experience. – Martijn Pieters Sep 01 '17 at 12:32
  • You can also say `encoding='latin1`. Works for me. – RajeshM Mar 10 '21 at 17:16
  • @RajeshM that could be because latin1 “works” on *any* file. **That doesn’t mean that the decoded text will be readable or without issues.** Windows-1252 and Latin-1 are closely related but *not the same*. If you get weird characters in your result you picked the wrong codec. – Martijn Pieters Mar 11 '21 at 08:34
  • This codec will avoid the error (like ISO 8859-1) but both have an issue with "don't" and similar-> turns into a root symbol. Source: a csv created from US based Excel 2010. Also, should it be mentioned that cp1252 is listed as "Western European"? – DISC-O May 23 '22 at 18:49
  • @DISC-O: Not sure what you mean by "root symbol"; the [SQUARE ROOT `√` symbol](https://www.fileformat.info/info/unicode/char/221a/index.htm) [is not part of cp1252](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) so presumably you have something else? Do you know the (hex or decimal) value of the specific byte? If the original dataset used a Windows 125x series codepage, it should be hex 92, decimal 146, for a right single quotation mark, `’`. – Martijn Pieters May 27 '22 at 12:03
  • @DISC-O: "US based Excel 2010" doesn't tell me anything, unfortunately; the [default should be 1252, apparently](https://stackoverflow.com/questions/508558/what-charset-does-microsoft-excel-use-when-saving-files). "Western Europe" is *one* of the names that Microsoft uses for the codec; the misnomer "ANSI Latin 1" is also used, historically. I'm not sure what mentioning it will add to the answer? – Martijn Pieters May 27 '22 at 12:20
7

It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError. To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library.

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()

Output:

    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    Unnamed: 15 2014    2015
0   Afghanistan 55.1    55.5    55.9    56.2    56.6    57.0    57.4    57.8    58.2    58.6    59.0    59.3    59.7    60.0    NaN 60.4    60.7
1   Albania 74.3    74.7    75.2    75.5    75.8    76.1    76.3    76.5    76.7    76.8    77.0    77.2    77.4    77.6    NaN 77.8    78.0
2   Algeria 70.2    70.6    71.0    71.4    71.8    72.2    72.6    72.9    73.2    73.5    73.8    74.1    74.3    74.6    NaN 74.8    75.0
3   American Samoa  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
4   Andorra ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
navule
  • 3,212
  • 2
  • 36
  • 54
  • Like Latin-1 / ISO-8859-1, the [Mac OS Roman characterset](https://en.wikipedia.org/wiki/Mac_OS_Roman) maps all 256 possible byte values to a single character so will **never** result in an decode error and will work on any file. That doesn't mean that you used the correct codec, you may still get weird characters in your dataset as a result. – Martijn Pieters May 27 '22 at 12:09
  • I did this and it changes the format of word. e.g. Can't got converted to canít. But encoding='cp1252' worked for me – user3665906 Mar 27 '23 at 05:31
3

Use 'ISO-8859-1' instead of "utf-8" for decoding

text = open(fn, 'rb').read().decode('ISO-8859-1')

Refer the link : https://grabthiscode.com/whatever/utf-8-codec-cant-decode-byte-0x85-in-position-715-invalid-start-byte

Mounesh
  • 561
  • 5
  • 18
  • Much better. Consider summarizing the reason why this helps and maybe how it works. Also check whether your post provides anything new, e.g. in comparision to the comment by Martin Piejters on https://stackoverflow.com/a/54886413/7733418 Or consider referring to that for credits. – Yunnosch Sep 24 '22 at 09:28
0

This problem occur because of some unknown characters in your file. for example, In your file with utf-8 encoding, there were some character in windows 1250. you should remove or replace this characters to solve your problems

AM80
  • 36
  • 1
  • 3
-2

This works

df = pd.read_csv(inputfile, engine = 'python')

  • Please read "[answer]" and "[Explaining entirely code-based answers](https://meta.stackoverflow.com/q/392712/128421)". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code. – the Tin Man Mar 20 '22 at 23:07