2

I have couple (each of 3.5GB) of gzip files, as of now I am using Pandas to read those files, but it is very slow, I have tried Dask also, but it seems it does not support gzip file breaking. Is there any better way to quickly load these massive gzip files?

Dask and Pandas code:

df = dd.read_csv(r'file', sample = 200000000000,compression='gzip')

I expect it to read the whole file as quickly as possible.

PeptideWitch
  • 2,239
  • 14
  • 30
  • 2
    gzip decompression is only innately parallelizable if the compressor was configured to reset its table every so often. Otherwise, one has to read from front-to-back to have the necessary state to understand the stream in-memory. See [`pigz`](https://zlib.net/pigz/) as a parallel gzip implementation (in C) providing such a compressor and decompressor; however, if you can't change the tools and settings used on the compression end, nothing you do on the decompression side will do much good. – Charles Duffy Aug 09 '19 at 00:00
  • 4
    ...and frankly, if you *can* change the tools used on the compression end, you'd be better off getting them to switch away from gzip. Take a look at the benchmarks at https://facebook.github.io/zstd/, particularly the ones for decompression. – Charles Duffy Aug 09 '19 at 00:03
  • I'm not sure about Pandas but the `gzip` module is implemented in pure python, so it's bound to be slow. Perhaps pypy could give you a speedup. At the end of the day, you're trying to uncompress several multi-gigabyte files and store their uncompressed data all in memory. That's going to take time. Consider streaming the data if possible. – Bailey Parker Aug 09 '19 at 00:12

3 Answers3

4

gzip is, inherently, a pretty slow compression method, and (as you say) does not support random access. This means, that the only way to get to position x is to scan through the file from the start, which is why Dask does not support trying to parallelise in this case.

Your best best, if you want to make use of parallel parsing at least, is first to decompress the whole file, so that the chunking mechanism does make sense. You could also break it into several files, and compress each one, so that the total space required is similar.

Note that, in theory, some compression mechanisms that support block-wise random access, but we have not found any with sufficient community support to implement them in Dask.

The best answer, though, is to store your data in parquet or orc format, which has internal compression and partitioning.

mdurant
  • 27,272
  • 5
  • 45
  • 74
2

One option is to use package datatable for python: https://github.com/h2oai/datatable

It can read/write significantly faster than pandas (to gzip) using the function fread, for example

import datatable as dt
df = dt.fread('file.csv.gz')

Later, one can convert it to pandas dataframe:

df1 = df.to_pandas()

Currently datatable is only available on Linux/Mac.

rpython
  • 21
  • 2
-1

You can try using gzip library:

import gzip
f = gzip.open('Your File', 'wb')
file_content = f.read()
print (file_content)

python: read lines from compressed text files

Kermit the Frog
  • 146
  • 2
  • 7
  • 7
    `f.read()` will load and decompress 3.5GB of on-disc data, so even if that doesn't crash your session, trying to print it all certainly will. – mdurant Aug 09 '19 at 14:51