Tried using pd.read_csv() function but a UnicodeDecodeError shows up

Question

So my database is this file called 'mnist.csv' The code I used to read the file is:

data = pd.read_csv('mnist.csv', engine = 'python')

Upon executing the code, the following error shows up:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 10: invalid start byte

I referred to this post

Over there I used the code

import pandas as pd
data = pd.read_csv('mnist.csv', encoding= 'unicode_escape')

But it again shows an error

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 8954-8955: truncated \UXXXXXXXX escape

Looks like the same question as https://stackoverflow.com/q/76773620/407651 — mzjn, Jul 31 '23 at 08:35
Yes that was my earlier post but I wanted to post another question with more clarity so that it would be more understandable — Aarush K, Jul 31 '23 at 08:54

S_Crespo · Answer 1 · 2023-07-31T10:27:07.420

I sum up my above comments here in hope that it helps next users.

The problem is that your filepath is a string that contains the backslash character (\) since you are on Windows.

(Note : this wasn't visible in your sample code, but my guess is that you simplified the path, since this solution worked).

And this character is also used to encode special characters. For instance, "\n" corresponds to line break.

So if your path is, say, "C:\Users\Aarush\Documents\mnist_train.csv", Python is trying to figure out what character the \U stands for. This can have many unexpected effects.

To avoid this and open your file properly, ensure that you add a "r" before the path, like this:

mnist = pd.read_csv(r"C:\Users\Aarush\Documents\mnist_train.csv")

This "r" stands for "raw" and means you turned your string into a raw string where all characters must be interpreted literally (and not as escape characters).

if you still have the error, this comes from the CSV file parsing: in that case, also check the separator (comma, semicolon... : this is the parameter "sep" in pd.read_csv()) and the file encoding (parameter "encoding" in pd.read_csv, usually "utf-8")

There are no Windows paths in the question. How can that be the problem? — mzjn, Jul 31 '23 at 10:20
Thanks mzjn for your comment. Since the proposed maneuver worked, my guess is that the path was simplified by the author of question. — S_Crespo, Jul 31 '23 at 10:22

tlentali · Accepted Answer · 2023-07-31T10:18:43.940

0

First, we can check the format used in the dataset by trying several values in the encoding parameter, here the latin-1 seems to work better than utf8. Then check the separator used (comma, semi-colon, tab...) :

import pandas as pd

data = pd.read_csv('mnist.csv', encoding='latin-1', sep=',')

Since the data you want to get is the classical MNIST, we can get it directly from keras like so :

from keras.datasets import mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

edited Jul 31 '23 at 10:18

answered Jul 31 '23 at 08:22

tlentali

3,407
2
14
21

Thanks for the reply but the same error shows up: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 10: invalid start byte – Aarush K Jul 31 '23 at 08:26
Hello Aarush K ! Hope you are well. Maybe you can try with `latin-1` encoding argument. Does it works ? – tlentali Jul 31 '23 at 08:30
I'm sorry but I don't know how to execute that :P – Aarush K Jul 31 '23 at 08:32
@AarushK `encoding="latin1"`... – AKX Jul 31 '23 at 08:32
Yes exactly, `data = pd.read_csv('mnist.csv', encoding='latin-1')` – tlentali Jul 31 '23 at 08:34
Im sorry to bother yall again but it shows an error again T_T ParserError: Error tokenizing data. C error: Expected 1 fields in line 15, saw 2 – Aarush K Jul 31 '23 at 08:37
1

Do you know if the separator is an another thing that a comma ? It is a semi-colon ? – tlentali Jul 31 '23 at 08:39
I updated my answer :) ! Can you try it and tell me if it is better :) – tlentali Jul 31 '23 at 08:41
1

To complete previous comment, add argument sep=';' in pd.read_csv(), an see how it wordks. The separator could also be something else in some cases ('\t', '-', etc...) but coma and semicolon are the most frequent. Also, keep in mind that both parameters `encoding` and `sep` have to be correct to allow file to be read, so you may need to try several combination before finding the proper one : 'utf-8' / '";", or "latin-1" / ";", etc... Read the pd.read_csv() documentation to learn more: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html – S_Crespo Jul 31 '23 at 08:44
Thanks @S_Crespo ! Yes, you can try `data = pd.read_csv('mnist.csv', encoding='latin-1', sep=';')` – tlentali Jul 31 '23 at 08:46
The errors are never ending! Now it shows this: ParserError: Error tokenizing data. C error: Expected 2 fields in line 75, saw 3 – Aarush K Jul 31 '23 at 08:49
1

Could you post a link to the csv file? (the link where you downloaded it from) – S_Crespo Jul 31 '23 at 08:49
Yeah sure give me a minute – Aarush K Jul 31 '23 at 08:51
I got it from Kaggle: https://www.kaggle.com/datasets/oddrationale/mnist-in-csv – Aarush K Jul 31 '23 at 08:52
This dataset is a classic one :) ! You can have it directly in keras. I updated my answer :) – tlentali Jul 31 '23 at 08:57
I opened it with no specific argument. But there may be some tricks: what is you OS (Linux, Windows, Mac) ? Bc defaut encoding is not the same (utf-8, windows, etc). Also, did you first unzip the content of zip file before reading it? And lastly, the zip file contains 2 files mnist_train and mnist_test but none just named mnist, so are you opening the proper file? – S_Crespo Jul 31 '23 at 09:01
@tlentali Yeah its pretty common, its my first time doing a project on Python and I'm actually trying to do an Image Classification Project! – Aarush K Jul 31 '23 at 09:18
@S_Crespo its Windows – Aarush K Jul 31 '23 at 09:19
@tlentali Thanks for the update! I actually dont use the module keras since my instructor told me to use the Pandas package – Aarush K Jul 31 '23 at 09:22
@S_Crespo Hi! I tried unzipping the file like you said and it ended up becoming 2 files! One is named mnist_train and the other mnist_test – Aarush K Jul 31 '23 at 09:23
So you must do: mnist_train = pd.read_csv(path/to/mnist_train.csv) and same for mnist_test. First try with the default parameters and if it doesn't work try changing one parameter (sep and encoding) at a time – S_Crespo Jul 31 '23 at 09:26
Side note: since you are on Windows, if there are backslashes in you file path (see example further), please add the letter 'r' before the string that represents the path to your file: this ensures that the backslash character '\' is considered as part of the string and not as an escape character. Forgetting the "r" (that means "raw straing") may cause an "encoding error" as well. For instance, if your file is in C:\Users\Aarush\Documents\mnist_train.csv, then you must type: `pd.read_csv(r"C:\Users\Aarrush\Documents\mnist_train.csv")` – S_Crespo Jul 31 '23 at 09:29
@S_Crespo tlentali Thank you so much!!! ITS WORKING!! – Aarush K Jul 31 '23 at 09:41
So glad to read that @Aarush K and very happy to help :) ! Can you accept the answer please ? – tlentali Jul 31 '23 at 09:50
Hi @AarushK if this or any answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. – tlentali Jul 31 '23 at 09:57
Oh sure, sorry I'm new to this website – Aarush K Jul 31 '23 at 10:02
I changed the answer several time to answer the problem as OP tried several propositions we provided with @S_Crespo that I just upvoted. I will bring more in the answer to cover all our propositions. – tlentali Jul 31 '23 at 10:15
I updated the answer to cover the pandas part and the keras plan b solution. Always happy to help. – tlentali Jul 31 '23 at 10:20

Tried using pd.read_csv() function but a UnicodeDecodeError shows up

2 Answers2

Linked

Related