recommendation for python to read 100GB .csv.gz file

Question

I have a ".csv.gz" file that is 100GB large in a remote linux. I definitely do not want to unzip it because the size would reach to 1T.

I am looking online for reading files. I saw on suggestion here

gzip? pandas? iterator?

My mentor suggested to pip the data after unzip it.

I would also need to consider the memory. So readlines() is definitely not my consideration.

I wonder if anyone has an optimal solution for this because the file is really large and it would take me a lot of time to just do anything.

What's wrong with the [accepted answer](https://stackoverflow.com/a/30868178/2653663) in the link you gave. That seems to read the compressed file line by line, so memory shouldn't be an issue. — user2653663, May 10 '19 at 15:28
Nothing wrong. I am just looking to see if there is a better solution. — momo, May 10 '19 at 15:30
@SueX Please explain what is not good enough about that answer- why do you need a better solution? using `gzip.open()` and iterating over the file handle is the most obvious idiomatic way to do this in python — Chris_Rands, May 10 '19 at 15:31
@SueX Your question would get be overall better and get more attention if you followed the procedure you linked to and edited your question with details on why it fails for you, or provide timings for that solution and make it clear that you are looking for faster approaches. — user2653663, May 10 '19 at 15:34

score 0 · Answer 1 · answered May 10 '19 at 15:28

0

You can pipe the file in chunks into your python and process it line by line as for line in sys.stdin: ...:

zcat 100GB.csv.gz | python <my-app>

answered May 10 '19 at 15:28

Maxim Egorushkin

What is the value in taking the decompression step outside of python here? it seems unnecessary and convoluted – Chris_Rands May 10 '19 at 15:39
@Chris_Rands IMO, what you suggest requires writing more code for no benefit. Python code may not care how your file is compressed, only that it comes into `stdin` uncompressed. In other words, your suggestion requires every piece of software to be able to handle different compression formats, which doesn't seem to be a good idea at all. – Maxim Egorushkin May 10 '19 at 15:51
It is trivial to adapt the python code to handle both compressed and non-compressed. anyway, your solution isn't cross-platform, no zcat on windows – Chris_Rands May 10 '19 at 18:53
@Chris_Rands Windows supports Linux applications natively: https://youtu.be/lwhMThePdIo – Maxim Egorushkin May 12 '19 at 16:46

score 0 · Answer 2 · edited May 10 '19 at 15:30

0

read the lines one by one by doing:

import sys

for line in sys.stdin:
    do_sth_with_the_line(line)

You call this python script with:

zcat | python_script.py

edited May 10 '19 at 15:30

user2653663

answered May 10 '19 at 15:29

Bernhard Hering

Hi. Thanks so much for the quick answer. Just wondering, whats the different between this and using the gzip(The answer in the link)? Are they similar? I am very new to large dataset – momo May 10 '19 at 15:41
my solution uses uncompressed data which is fed by the zcat call. the gzip answer in the link relies on compressed data and dont need a seperate zcat call. – Bernhard Hering May 10 '19 at 15:43

2 Answers2