How to solve python 'utf-8' error?

Question

I am trying to read a file of 6GB in my python 3 terminal and was not able to execute the read file line. the code is as below:

#define data directory

data_dir = 'C://Star/star_data/csv\Globe'

#read the review dataset
yelp = pd.read_csv(data_dir+'\star_data_python.csv')
X, y = star.data, star.target
X.shape

error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-bc09b45c73bb> in <module>()
      4 
      5 #read the review dataset
----> 6 yelp = pd.read_csv(data_dir+'\star_data_python.csv')
      7 X, y = star.data, star.target
      8 X.shape

What could be the problem? thanks

You are using both `/` and `\ ` in your path... if you are using windows please use only `/` with `r` in front on your string e.g `data_dir = r'C://Star/star_data/csv/Globe'` — Kruupös, Jun 30 '17 at 11:40
If not a path issue, try to add `,encoding='utf-8'` to read_csv — Kruupös, Jun 30 '17 at 11:44
#define data directory I have corrected it to the line below but i am still getting the same error. data_dir = r'C://Star/star_data/csv/Globe' #read the review dataset yelp = pd.read_csv(data_dir + '/star_data_python.csv') — , Jun 30 '17 at 11:47
Hi @MugB, try to add encoding utf-8 to your csv file: `pd.read_csv(data_dir+'\star_data_python.csv, encoding='utf-8')` — Kruupös, Jun 30 '17 at 11:49
pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:6086)() pandas\parser.pyx in pandas.parser.TextReader._get_header (pandas\parser.c:9266)() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 2543: invalid continuation byte — , Jun 30 '17 at 11:49
i am still getting the error, i wonder if it is due to the file size, it has about 30, 000 dimensions — , Jun 30 '17 at 11:50
`pd.read_csv(data_dir+'\star_data_python.csv, encoding='utf-8', errors='ignore')`, or try to find the right encoding of your file :) This is **not** a size problem. — Kruupös, Jun 30 '17 at 11:51
seems to be suggesting for me to create some kind of parser. ypeError Traceback (most recent call last) in () 4 5 #read the review dataset ----> 6 yelp = pd.read_csv(data_dir+'\star_data_python.csv', encoding='utf-8', errors='ignore') TypeError: parser_f() got an unexpected keyword argument 'errors' — , Jun 30 '17 at 11:54
pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:10415)() pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)() pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:11437)() pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:11308)() pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:27037)() CParserError: Error tokenizing data. C error: out of memory — , Jun 30 '17 at 12:02
I'm sorry read_csv doesnt have an `errors` argument, please see my link do check all optional arguments in my answer. — Kruupös, Jun 30 '17 at 12:10
it worked with ISO but still had memory issues trying to process the data — , Jul 01 '17 at 12:17
Ok, you may need to cut your file in several chunks if possible. I run into a similar issue a year ago and decided to create a file to keep indinces of chunks over a big file. The plus size is that you may be able to use multithread. Good luck! — Kruupös, Jul 01 '17 at 15:47
thanks for the idea Max. as much as possible i didnt want to manipulate the raw file as I noticed when i attempt to do that, it gets saved incorrectly. lots more to learn on how best to deal with huge files like this! :) — , Jul 01 '17 at 15:53

score 1 · Accepted Answer · answered Jun 30 '17 at 12:05

Use the r before your path since you are on Windows:

e.g

data_dir = r'C://Star/star_data/csv/Globe'

The 'r' means that the string is to be treated as a raw string, which means all escape codes will be ignored.

Try calling read_csv with encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'; these the various encodings found on Windows.

e.g

full_path = data_dir + r'/star_data_python.csv'
pd.read_csv(full_path, encoding='latin1')

List of helpful SO answers:

How to solve python 'utf-8' error?

1 Answers1