1

I am trying to read a file of 6GB in my python 3 terminal and was not able to execute the read file line. the code is as below:

#define data directory

data_dir = 'C://Star/star_data/csv\Globe'

#read the review dataset
yelp = pd.read_csv(data_dir+'\star_data_python.csv')
X, y = star.data, star.target
X.shape

error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-bc09b45c73bb> in <module>()
      4 
      5 #read the review dataset
----> 6 yelp = pd.read_csv(data_dir+'\star_data_python.csv')
      7 X, y = star.data, star.target
      8 X.shape

What could be the problem? thanks

  • 1
    You are using both `/` and `\ ` in your path... if you are using windows please use only `/` with `r` in front on your string e.g `data_dir = r'C://Star/star_data/csv/Globe'` – Kruupös Jun 30 '17 at 11:40
  • If not a path issue, try to add `,encoding='utf-8'` to read_csv – Kruupös Jun 30 '17 at 11:44
  • #define data directory I have corrected it to the line below but i am still getting the same error. data_dir = r'C://Star/star_data/csv/Globe' #read the review dataset yelp = pd.read_csv(data_dir + '/star_data_python.csv') –  Jun 30 '17 at 11:47
  • Hi @MugB, try to add encoding utf-8 to your csv file: `pd.read_csv(data_dir+'\star_data_python.csv, encoding='utf-8')` – Kruupös Jun 30 '17 at 11:49
  • pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:6086)() pandas\parser.pyx in pandas.parser.TextReader._get_header (pandas\parser.c:9266)() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 2543: invalid continuation byte –  Jun 30 '17 at 11:49
  • i am still getting the error, i wonder if it is due to the file size, it has about 30, 000 dimensions –  Jun 30 '17 at 11:50
  • `pd.read_csv(data_dir+'\star_data_python.csv, encoding='utf-8', errors='ignore')`, or try to find the right encoding of your file :) This is **not** a size problem. – Kruupös Jun 30 '17 at 11:51
  • seems to be suggesting for me to create some kind of parser. ypeError Traceback (most recent call last) in () 4 5 #read the review dataset ----> 6 yelp = pd.read_csv(data_dir+'\star_data_python.csv', encoding='utf-8', errors='ignore') TypeError: parser_f() got an unexpected keyword argument 'errors' –  Jun 30 '17 at 11:54
  • pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:10415)() pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)() pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:11437)() pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:11308)() pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:27037)() CParserError: Error tokenizing data. C error: out of memory –  Jun 30 '17 at 12:02
  • I'm sorry read_csv doesnt have an `errors` argument, please see my link do check all optional arguments in my answer. – Kruupös Jun 30 '17 at 12:10
  • thanks Max, i will try. –  Jun 30 '17 at 12:14
  • good job :) which encoding did you use at the end? – Kruupös Jun 30 '17 at 12:19
  • it worked with ISO but still had memory issues trying to process the data –  Jul 01 '17 at 12:17
  • Ok, you may need to cut your file in several chunks if possible. I run into a similar issue a year ago and decided to create a file to keep indinces of chunks over a big file. The plus size is that you may be able to use multithread. Good luck! – Kruupös Jul 01 '17 at 15:47
  • thanks for the idea Max. as much as possible i didnt want to manipulate the raw file as I noticed when i attempt to do that, it gets saved incorrectly. lots more to learn on how best to deal with huge files like this! :) –  Jul 01 '17 at 15:53

1 Answers1

1

Use the r before your path since you are on Windows:

e.g

data_dir = r'C://Star/star_data/csv/Globe'

The 'r' means that the string is to be treated as a raw string, which means all escape codes will be ignored.

Try calling read_csv with encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'; these the various encodings found on Windows.

e.g

full_path = data_dir + r'/star_data_python.csv'
pd.read_csv(full_path, encoding='latin1')

List of helpful SO answers:

Kruupös
  • 5,097
  • 3
  • 27
  • 43