1

I am trying to process a series of .gz (gzipped) files. I would swear that they were reading successfully earlier when I first started debugging other parts of the code, but I can't swear to that. I switched to an uncompressed test file, so I could see what was causing some of the type conversions to fail. Once I got that debugged and I went to try processing the real gzipped files, I started getting errors. I would appreciate any ideas on what the problem might be and/or how to go about investigating it further.

I have stripped it down to the following code:

#!/usr/bin/env python3

import numpy as np
import pandas as pd

filename = './small_test.csv.gz'

names = ['string_var','int_var','float_var','date_var']
types = {'string_var': 'string','int_var':'int64','float_var':'float64','date_var':'string'}
with open(filename) as csvfile:
    print(filename)
#    df = pd.read_csv(csvfile,names=names,header=0,dtype=types)
#    df = pd.read_csv(csvfile,compression='gzip')
    df = pd.read_csv(csvfile)
    print(df.info(verbose=True))

I have tried just specifying the file and defaulting everything, specifying the file and the compression, and doing what I really need to do, which is specifying the names and types as well. I have also tried all those combinations on my full data set. They all fail in the same way with the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I found other questions on stackoverflow suggesting it was an encoding problem. I have the proper .gz extension that read_csv uses to infer, and I also explicitly specified it. The stack trace (below) shows it is getting into the gzip routine. The file -I command properly identifies the compressed file as gzip: small_test.csv.gz: application/x-gzip; charset=binary and the text file as ASCII: small_test.csv: text/plain; charset=us-ascii so that doesn't appear to be the problem.

based on the above, I also tried encoding='ascii' and encoding='us-ascii'. They failed int the same way.

There was another one where they didn't have the .gz extension, so it was gzipped and it was trying to read it as uncompressed, but that is not my issue. If I unzip the file it works fine. If I rezip it it fails. gzcat and gzip work just fine on all the files, so I don't think it is a corruption issue.

In case it is useful, here is the test file:

"string_var","int_var","float_var","date_var"
a,1,1.0,"2020-01-01 21:20:19"
b,2,2.0,"2019-10-31 00:00:00"
c,3,3.0,"1969-06-22 12:00:00"

And finally, this is the entire stack trace:

Traceback (most recent call last):
  File "./test_read_csv.py", line 14, in <module>
    df = pd.read_csv(csvfile,compression='gzip',encoding='us-ascii')
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 719, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2062, in pandas._libs.parsers.raise_parser_error
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 406, in _read_gzip_header
    magic = self._fp.read(2)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 91, in read
    self.file.read(size-self._length+read)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
William Allcock
  • 134
  • 2
  • 9
  • try encoding='latin-1' – trigonom Mar 01 '20 at 19:03
  • or encoding='utf16', or encoding = 'unicode_escape' otherwise you need to open it as a file and continue from here https://stackoverflow.com/questions/42339876/error-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-in/42340744 – trigonom Mar 01 '20 at 19:15
  • I am on business travel, so probably wont be able to do this till tomorrow. I will try it, but just so I understand, the error says UTF-8 and this is plain ASCII, which is valid UTF-8. I have already moved on to testing my code on the uncompressed version of the file, and I can just try something and if it works go with it, but I dont understand why the encoding would be an issue on plain ASCII text? – William Allcock Mar 02 '20 at 16:10
  • no idea, I had the same issue with ASCII files which had unreadable chars and encoding problems – trigonom Mar 02 '20 at 21:20
  • I tried all the suggested encoding and nothing worked. I believe this is a bug. If I gunzip the files at the command line, it processes them just fine, when they unzip it, it fails. This might just be coincidence, but per [this question](https://stackoverflow.com/questions/44659851/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x8b-in-position-1-invalid/44660123) 0x1f 0x8b is the magic number for a gzipped file. I tried that and also got the unicode decode error. I appreciate the help. – William Allcock Mar 03 '20 at 13:28

2 Answers2

2

Well, after digging through the Pandas code with a ton of help from my colleague, we figured this out. Here is the short version: If you want to open a gzipped file and pass it to read_csv(), you have to open it in binary AND specify the compression:

with open(filename, 'rb') as csvfile:
    df = pd.read_csv(csvfile,compression='gzip')

Letting read_csv() do the open also works: read_csv(filename) #filename is a string ending in .gz

The primary problem is that I did not open the file in binary. Since I did not, csvfile had a default encoding of UTF-8. So, here are the scenarios:

with open(filename) as csvfile: # Not binary

  • read_csv(csvfile): Pandas uses a text parser, which fails because the file is gzipped
  • read_csv(csvfile, compression='gzip'): This is what I worked on most. It did get down into gzip (which was what was so confusing) and then called read_header, but since the file handle was set to be UTF-8, it was again using the text reader and failed.

with open(filename, 'rb') as csvfile:

  • read_csv(csvfile): This still fails. This time it fails because the default for compression is 'infer', BUT if you read the doc closely, 'infer' only works if it is "path like". It infers based on the file extension which it didn't have because it was passed a file handle, not a string representation of the path. This ends up being identical to the read_csv(csvfile) case above when it wasn't opened in binary.

  • read_csv(csvfile, compression='gzip'): This is what works. The file is binary, so doesn't use a UTF reader and it is explicitly told that it is gzipped so it calls the gzip library

William Allcock
  • 134
  • 2
  • 9
2

I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing encoding="utf-8" with encoding = "ISO-8859-1" will solve the problem.

df = pd.read_csv(csv_file_or_csv.gz_file, encoding = "ISO-8859-1")
cthemudo
  • 381
  • 4
  • 5