I am trying to process a series of .gz (gzipped) files. I would swear that they were reading successfully earlier when I first started debugging other parts of the code, but I can't swear to that. I switched to an uncompressed test file, so I could see what was causing some of the type conversions to fail. Once I got that debugged and I went to try processing the real gzipped files, I started getting errors. I would appreciate any ideas on what the problem might be and/or how to go about investigating it further.
I have stripped it down to the following code:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
filename = './small_test.csv.gz'
names = ['string_var','int_var','float_var','date_var']
types = {'string_var': 'string','int_var':'int64','float_var':'float64','date_var':'string'}
with open(filename) as csvfile:
print(filename)
# df = pd.read_csv(csvfile,names=names,header=0,dtype=types)
# df = pd.read_csv(csvfile,compression='gzip')
df = pd.read_csv(csvfile)
print(df.info(verbose=True))
I have tried just specifying the file and defaulting everything, specifying the file and the compression, and doing what I really need to do, which is specifying the names and types as well. I have also tried all those combinations on my full data set. They all fail in the same way with the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I found other questions on stackoverflow suggesting it was an encoding problem. I have the proper .gz extension that read_csv uses to infer, and I also explicitly specified it. The stack trace (below) shows it is getting into the gzip routine. The file -I command properly identifies the compressed file as gzip:
small_test.csv.gz: application/x-gzip; charset=binary
and the text file as ASCII:
small_test.csv: text/plain; charset=us-ascii
so that doesn't appear to be the problem.
based on the above, I also tried encoding='ascii' and encoding='us-ascii'. They failed int the same way.
There was another one where they didn't have the .gz extension, so it was gzipped and it was trying to read it as uncompressed, but that is not my issue. If I unzip the file it works fine. If I rezip it it fails. gzcat and gzip work just fine on all the files, so I don't think it is a corruption issue.
In case it is useful, here is the test file:
"string_var","int_var","float_var","date_var"
a,1,1.0,"2020-01-01 21:20:19"
b,2,2.0,"2019-10-31 00:00:00"
c,3,3.0,"1969-06-22 12:00:00"
And finally, this is the entire stack trace:
Traceback (most recent call last):
File "./test_read_csv.py", line 14, in <module>
df = pd.read_csv(csvfile,compression='gzip',encoding='us-ascii')
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 719, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2062, in pandas._libs.parsers.raise_parser_error
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 463, in read
if not self._read_gzip_header():
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 406, in _read_gzip_header
magic = self._fp.read(2)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 91, in read
self.file.read(size-self._length+read)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte