0

I want to load the values from the "category" column into a pandas df, this is my tsv file:

Tagname   text  category
j245qzx_8   hamburger toppings   f
h833uio_7   side of fries   f
d423jin_2   milkshake combo   d

This is my code:

with open(filename, 'r') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

However I get a UnicodeDecodeError for the line df = pd.read_csv(f, sep='\t') and my code stops there:

File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte

Any ideas why or how to fix this? It doesn't seem like there's any special characters in my tsv so I'm not sure what's causing this or what to do.

brownleaf
  • 11
  • 5
  • Looks like Pandas is expecting a UTF-8 stream of bytes, and your file has some other (non-ASCII) encoding. Try some of the command-line utilities listed [here](https://stackoverflow.com/questions/805418/how-can-i-find-encoding-of-a-file-via-a-script-on-linux), `file -I/i`, `uchardet`, etc... I copy-pasted your sample and it looks fine, but there's probably something lost between your copy and my paste. – Zach Young Nov 12 '21 at 23:40
  • 2
    Does this answer your question? [UnicodeDecodeError when reading CSV file in Pandas with Python](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python) – Rodalm Nov 12 '21 at 23:41
  • Thanks for the responses, yeah I just saw @ZachYoung there's one line later on in the file with an apostrophe that didn't register as utf8. I'm not sure how to convert it though (to ensure everything in the file is utf8) and I'd need to do everything in my python script rather than command-line. – brownleaf Nov 12 '21 at 23:47
  • @HarryPlotter That looked super promising but when I tried adding `encoding = "utf-8"` after (so that gives me `df = pd.read_csv(f, sep='\t', encoding = "utf-8")`) it still gave me the exact same error unfortunately – brownleaf Nov 12 '21 at 23:48
  • How did you write the TSV file? If its a Windows machine, check its encoding. In pre-UTF days, you kinda have to know what your encoding is. – tdelaney Nov 13 '21 at 00:08
  • @tdelaney The TSV file was generated from a python script I made. I was working with a dataset and then extracted the data of interest and exported it to a tsv file, the one I'm trying to do further work on here in this question – brownleaf Nov 13 '21 at 00:13
  • @brownleaf - Use a specific encoding when you write the CSV - utf-8 is a good idea, although on Windows, there is an argument for utf-16 and a BOM (byte order mark) which may make it easier to import into Windows tools like excel ... if you care about that! – tdelaney Nov 13 '21 at 03:41

1 Answers1

1

The fix

Just read this SO, and I think I see what's wrong.

You're getting a file handle with Python's open() and passing that to Pandas's read_csv(). open() determines the file's encoding.

So, try setting the encoding in open(), like this:

with open(filename, 'r', encoding='windows-1252') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

Or, don't use open() at all:

df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]

categoryList = []
for line in categoryColumn:
    categoryColumn.append(line)

Some of the back story

I echo'ed x89 into the end of your sample, then ran Python's chardetect utility, and it's suggesting it's Window-1252:

% echo -e '\x89' >> sample.csv

% cat sample.csv 
Tagname text    category
j245qzx_8       hamburger toppings      f
h833uio_7       side of fries   f
d423jin_2       milkshake combo d
�

% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect

% chardetect sample.csv 
sample.csv: Windows-1252 with confidence 0.73
Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • Sadly I'm still getting the same error. Does it makes a difference that I'm using a mac instead of windows? Edit: just refreshed the page to your updated answer, I'll try that out – brownleaf Nov 13 '21 at 00:00
  • 1
    No, OS doesn't affect Python's sense of what encoding to use. Did you try `chardetect`? What'd it report? – Zach Young Nov 13 '21 at 00:02
  • Oh wow it worked! When I tried using open() instead of pandas as you suggested, it worked perfectly, thank you :-) – brownleaf Nov 13 '21 at 00:04
  • So, that line `df = ... encoding='windows-1252')` did ***not*** work? I don't have Pandas, so I cannot test... but that's surprising. – Zach Young Nov 13 '21 at 00:05
  • Yeah it didn't work, I'm not sure why. It would've been easier to do some analysis with the data using pandas though I think (cause I only want to look at the category column) but I can still find a workaround with a nested list (unless you have any suggestions on how to extract only the category column?) – brownleaf Nov 13 '21 at 00:10
  • 1
    @brownleaf, I might have a Pandas fix. Check out latest edit. – Zach Young Nov 13 '21 at 00:14
  • Do you know if there's a way to include multiple types of encoding? The contents of my tsv file may not be the same every time I run it (my data isn't static) so sometimes it gives me a different error for a different encoding (e.g. `0x8f` which I can fix with the same logic in your post if I replace `windows-1252` with `cp850`) but because of the dynamic nature of the data I can't predict what kinds of characters/encoding (or if there are multiple kinds) that exist. Is there a way to ensure it can work every time for any set of characters/encoding? – brownleaf Nov 13 '21 at 01:20
  • You need to know what the encoding is to decode... there's nothing in Python's standard lib that I know that will guess. Still, there's that `chardetect` tool that came with my standard Python install, so you have some capabilities there to detect... but you'll have to dig and search for this. At the very least, another SO question. – Zach Young Nov 13 '21 at 04:39
  • Another thought... the difference between windows-1250 and cp-850 is for the non-normal English characters. If you know or suspect the data you care about is just the normal ASCII characters, maybe pick an encoding like 8859-1 and discard any differences in the upper range. Is that whacky apostrophe (or whatever it was) that kicked off this problem and SO post really important to your process? If not, ignore it. – Zach Young Nov 13 '21 at 04:44