18

I scrawled down the data and had to save the dataframe as utf-16 (Unicode) since the Latin/Spanish words were shown weird in the form of utf-8. I used the following code to save the dataframe:

 df.to_csv("blogdata.csv", encoding = "utf-16", sep = "\t", index = False)

when I try to read the file to clean the data using the following code:

 blogdata = pd.read_csv('c:/Users/hyoungm?Downloads/blogdata.csv')

it shows the following error.


UnicodeDecodeError Traceback (most recent call last) in () ----> 1 blogdata = pd.read_csv('C:/Users/hyoungm/Downloads/blogdata.csv')

...

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.cinit()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Please see my screenshot here: enter image description here

I don't know either how to save the original data without losing those Laint/Spanish words within English sentences or how to read Unicode data file. Can anybody please help me with solving this issue?

Thank you very much!

Hyoungeun Moon
  • 237
  • 1
  • 2
  • 6

5 Answers5

17

Was facing same, you can try

blogdata = pd.read_csv('c:/Users/hyoungm?Downloads/blogdata.csv',sep = "\t", encoding='latin')
Kashifa
  • 171
  • 1
  • 2
16

There is a Python library which may help when the encoding is unknown: chardet

with open(filename, 'rb') as file:
    print(chardet.detect(file.read()))

detect finds the encoding, and 'rb' will read the file in as binary

Helen Batson
  • 191
  • 1
  • 7
  • 2
    It's no so useful programmatically, but if you just need one-off encoding detection and have Notepad++ installed, it well give you that info (note: I have found that 'UCS-2 LE BOM' can be read using `encoding='utf-16'`) – James Aug 20 '20 at 16:16
  • This worked fine in a test I did, although I got a 'confidence' of 0.73. Not to rely on, isn't it? Then, this is quite an "abnormal" method for just reading text files ... And consider that none of this was needed in Python 2.7, which had not problem at all with such unicode stuff! – Apostolos May 29 '21 at 07:17
  • wow, that was brilliant! it worked for me – capivarao Oct 18 '22 at 13:12
4

It seems that you're trying to decode your utf-16 encoded file with the utf-8 codec.

According to pandas documentation, you can specify the codec by passing the encoding argument to the read_csv() function.

Could you try the following code?

blogdata = pd.read_csv('c:/Users/hyoungm?Downloads/blogdata.csv', encoding = 'utf-16')

Hope this helps. And let me know if something is unclear.

EDIT: I guess the right file path should be 'c:/Users/hyoungm/Downloads/blogdata.csv' with a '/' between 'hyoungm' and 'Downloads', so adapt the script accordingly if I'm right.

Séraphin
  • 710
  • 7
  • 18
  • Thank you! It reads. But another issue arises. The data is combined in the first column, which has to be divided into 11 columns since each column shows different variable (e.g., blogger, country, joined data, the number followers, posting, etc.). Could you please help me with formatting the data in a table? I tried df.to_csv("blogdata.csv", encoding = "utf-16", "r"); df.to_csv("blogdata.csv", encoding = "utf-16", "rb"); df.to_csv("blogdata.csv", encoding = "utf-16", sep = ","); and df.to_csv("blogdata.csv", encoding = "utf-16", sep = "\t", index = False) – Hyoungeun Moon Apr 07 '19 at 21:46
  • I tried to read the file anyway but still shows the same error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte'. – Hyoungeun Moon Apr 07 '19 at 21:53
  • Can you share a sample of your file with sharable data? It would be useful to identify the right format. – Séraphin Apr 07 '19 at 22:04
  • Please see this link: github.com/GemmyMoon/MultipleUrls.git for the csv file. This is saved as a csv file with utf-16. Would you mind trying with this file? Thank you for your edits in the answer. That was my typo and should be the slash. I tried the corrected code, but the result was the same. – Hyoungeun Moon Apr 09 '19 at 01:20
  • I downloaded the file and just read it by running this command : `blogdata = pd.read_csv('/home/seraphin/urls.csv')` In your question, you're mentioning having troubles with a file that you encoded by yourself. Is it this file that I got on the link you gave me in the previous comment? **Note:** The file only contains urls, which can only contain ASCII characters. So, no need for utf-16 encoding. – Séraphin Apr 09 '19 at 20:31
  • Oh my, my apology for the mistake. Please see this new address: https://github.com/GemmyMoon/blogdata. I uploaded the csv file that I have now. Please let me know if you cannot access to this one. Thank you! – Hyoungeun Moon Apr 10 '19 at 00:42
  • This file does not work. Actually, I'm not sure it has been encoded with the script you mentioned in your question. Could you please start from the beginning and let me know the result of this script? ``` import pandas as pd data = df = pd.DataFrame(data) df.to_csv('blogdatatest.csv', encoding='utf-16',sep = "\t" ) blogdata = pd.read_csv('blogdatatest.csv', encoding='utf-16',sep = "\t") ``` – Séraphin Apr 10 '19 at 01:04
  • Thank you for your trial. If you don't mind, would you check out the file that I just added (named blogdatatest_encoding_utf-16.csv) on the same website? – Hyoungeun Moon Apr 10 '19 at 01:41
  • I would prefer if you could first test the script in my previous comment. I need to be sure to start from the same base as you do. – Séraphin Apr 10 '19 at 01:54
  • I just added the blogdatatest file following your codes above. Could you please check this file? Thank you. – Hyoungeun Moon Apr 10 '19 at 02:08
  • What was the result when you ran the code I shared here over? – Séraphin Apr 10 '19 at 03:27
4

Try pd.read_csv(file_name, encoding_errors= 'replace') This replaces every file that couldn't be read in a format readable by the encoder.

0

It might be not the case for OP but for completeness this error can also be triggered by reading binary files without b mode. See https://docs.python.org/3/library/functions.html#open

If you are trying to read serialized data such as the files saved by pickle or torch, you need open("filename","rb") instead of open("filename")

Qin Heyang
  • 1,456
  • 1
  • 16
  • 18