Keep Getting a UnicodeDecodeError When Trying to Read CSV with Pandas

Question

I am trying to read a csv in python, and keep getting the below error. I tried other csv files that I worked with previously without issue on my other computer, and I get the same error message with those as well. I recently switched computers, but what is also bizarre is that yesterday I read a different csv saved in the same network location without any problems. I have no idea what is causing this but would like to be able to load my previous files if anyone has any ideas.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [17], in <module>
      1 import pandas as pd
----> 3 df = pd.read_csv(r"C:\Users\nabecker\OneDrive - McDermott Will & Emery LLP\Documents\Parent Data for Analysis.csv")

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py:586, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    571 kwds_defaults = _refine_defaults_read(
    572     dialect,
    573     delimiter,
   (...)
    582     defaults={"delimiter": ","},
    583 )
    584 kwds.update(kwds_defaults)
--> 586 return _read(filepath_or_buffer, kwds)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py:482, in _read(filepath_or_buffer, kwds)
    479 _validate_names(kwds.get("names", None))
    481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
    484 if chunksize or iterator:
    485     return parser

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py:811, in TextFileReader.__init__(self, f, engine, **kwds)
    808 if "has_index_names" in kwds:
    809     self.options["has_index_names"] = kwds["has_index_names"]
--> 811 self._engine = self._make_engine(self.engine)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py:1040, in TextFileReader._make_engine(self, engine)
   1036     raise ValueError(
   1037         f"Unknown engine: {engine} (valid options are {mapping.keys()})"
   1038     )
   1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:69, in CParserWrapper.__init__(self, src, **kwds)
     67 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
     68 try:
---> 69     self._reader = parsers.TextReader(self.handles.handle, **kwds)
     70 except Exception:
     71     self.handles.close()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\_libs\parsers.pyx:542, in pandas._libs.parsers.TextReader.__cinit__()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\_libs\parsers.pyx:642, in pandas._libs.parsers.TextReader._get_header()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\_libs\parsers.pyx:843, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\_libs\parsers.pyx:1917, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 95538: invalid continuation byte

Do you know the encoding of the file you're trying to load? Have a look at this question to check invalid bytes in your file: https://stackoverflow.com/questions/29465612/how-to-detect-invalid-utf8-unicode-binary-in-a-text-file — Jan Wilamowski, Jan 26 '22 at 02:17
No, those other posts don't answer my question. I know workarounds, but the point is that I had no issue yesterday with a file that is stored in the exact same location as the one that is giving me the error now. In addition, older csv files that I saved and worked with previously no longer work. I just can't tell if this is a computer or Windows issue, a Python or Pandas version issue or what. — nb1214, Jan 26 '22 at 02:27
You haven't answered my question: do you know the encoding of the file? If so, pass it to `pd.read_csv()`. Otherwise it will assume UTF-8 which may no be correct. — Jan Wilamowski, Jan 26 '22 at 02:35
No, I do not know the encoding. They are standard excel files with numbers and strings, no special characters or anything. One works and one does not and they are essentially different versions of the same file. Plus, the 10 other csv files that I have worked with a week ago no longer read either. They all worked without a hiccup before and they are all stored in the same folder. I realize I can save them as csv UTC-8 and they will work, but I am really curious why I never had to do that previously with the exact same files. — nb1214, Jan 26 '22 at 02:42
I passed it into pd. read_csv(r"filepath") originally, but that is when I get the error message that I posted. — nb1214, Jan 26 '22 at 02:43
Something may have changed on your system. Are you sure there are no special characters? Pretty much everything outside the English alphabet is "special". You could check which encoding Excel used to read the file. You could also try providing some common encodings to Pandas and see if it works by trial and error. — Jan Wilamowski, Jan 26 '22 at 02:49

score 0 · Answer 1 · answered Jan 26 '22 at 03:27

0

It seems that you stored your files on OneDrive.

Somethine the network drive change file encoding. For example, whenever I save my file in Dropbox on Window, I face this kind of issues; something get changed so I have to be care of using it on Mac.

There are several ways to deal with this kind of encoding issues:

# Way 1. use "ISO-8859-1" (or "latin-1") encoding when you open the file
f = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

# Way 2. ignore error when you open the file
f = open('u.item', encoding='utf8', errors='ignore')

Please note that the file are correctly opened and all the characters are clear when you successfully (without an exception) loaded the file.

answered Jan 26 '22 at 03:27

Park

2,446
1
16
25

Yes my files are on OneDrive, but so is the one that works without issue, including in the same folder containing many of the same fields. Why would this one file read fine and the other not? I am wondering if there is just a strange issue on the backend and if I need to reinstall python. – nb1214 Jan 27 '22 at 00:02
Also, all of the files read fine on my other computer. Therefore, it is something with my python installation on this computer, the new computer itself or something. My new computer has no problem reading the one file, but won't read any of the others. My old computer will read any csv without errors. – nb1214 Jan 27 '22 at 00:10
@nb1214 I understand. It is very complicated. In my experience of text analysis, it usually happens when the one file includes at least one special character, which needs different encoding. Sometimes it won't an issue because the special characters can be loaded with inappropriate encoding but sometimes it will be problem when the special characters must be loaded with a specific encoding. Very hard to find it if text is long and there are many files. – Park Jan 27 '22 at 00:23
I get what you are saying, but it is every file I have tried except 1 that does not work (I have tried at least 10). Yet, they all work without issue on my other computer, so it is not the files themselves that are different as I have been working with them without issue for a long time. What would the difference be between the two machines, and why can I read 1 file without issue and not any of the others. None of the files have special characters either, they are all numbers without commas, dollar signs or any other characters. – nb1214 Jan 27 '22 at 00:31
The version of Python on this computer is different, and maybe I need a clean install or to go back to an older version of Python. Just so strange the one file reads without a problem and the others don't, and they contain the same data and are stored in the same folder. – nb1214 Jan 27 '22 at 00:32
@nb1214 Aha, I hope you can solve the problem by installing same version of Python soon! – Park Jan 27 '22 at 00:34
@nb1214 If this helped solve your problem, pls mark this answer as accepted, so that this can help other users know this question is solved by this answer. – Park Feb 07 '22 at 14:05
Yes, I reinstalled and everything works fine. Must have been something weird. – nb1214 Feb 09 '22 at 00:43

Keep Getting a UnicodeDecodeError When Trying to Read CSV with Pandas

1 Answers1