Read lines from compressed text files

Question

Is it possible to read a line from a gzip-compressed text file using Python without extracting the file completely? I have a text.gz file which is around 200 MB. When I extract it, it becomes 7.4 GB. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extracting the whole information. How can this be done using Python? I need to read the text file line-by-line.

Does this answer your question? [Read from a gzip file in python](https://stackoverflow.com/questions/12902540/read-from-a-gzip-file-in-python) — Michael Hall, Apr 24 '23 at 00:39

fferri · Answer 1 · 2018-07-23T15:18:34.550

145

Using gzip.GzipFile:

import gzip

with gzip.open('input.gz','rt') as f:
    for line in f:
        print('got line', line)

Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode). I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

edited Jul 23 '18 at 15:18

answered Jun 16 '15 at 12:54

fferri

18,285
5
46
95

score 55 · Answer 2 · answered May 12 '12 at 19:10

55

You could use the standard gzip module in python. Just use:

gzip.open('myfile.gz')

to open the file as any other file and read its lines.

More information here: Python gzip module

answered May 12 '12 at 19:10

smichak

4,716
3
35
47

5

out of curiosity does this load the entire file to memory? Or is it smart enough to load lines as needed? – sachinruk Mar 17 '17 at 00:08
3

@Sachin_ruk this doesn't load the file it just opens it. In order to actually load the data from file you need to do ```f.readline()`` to read line at a time. Or ``f.readlines(N)`` where ``N`` is the number of lines you want to read. – Tom Apr 18 '17 at 11:14

score 24 · Accepted Answer · answered May 12 '12 at 19:04

24

Have you tried using gzip.GzipFile? Arguments are similar to open.

answered May 12 '12 at 19:04

jrennie

1,937
12
16

I guess `gzip.GzipFile(file_name)` will not give the expected result, at least in Python 3+. The first thing I see in `GzipFile.__init__` in Python 3.9 library is `if mode and ('t' in mode or 'U' in mode): raise ValueError(...)`, and its `readlines()` returns `list[bytes]`. – Victor Sergienko Aug 10 '22 at 00:38

score 7 · Answer 4 · answered Apr 27 '20 at 21:26

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.

The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:

#!/usr/bin/python
import os
import sys
import time
import gzip

def local_unzip(obj):
    t0 = time.time()
    count = 0
    with obj as f:
        for line in f:
            count += 1
    print(time.time() - t0, count)

r = sys.argv[1]
if sys.argv[2] == "1":
    local_unzip(gzip.open(r,'rt'))
else:
    local_unzip(os.popen('pigz -dc ' + r))

Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:

$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116

$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996

Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

Read lines from compressed text files

4 Answers4

Linked

Related