Count columns of gzipped tsv without loading

Question

I have a large tab-delimited file that has been gzipped, and I would like to know how many columns it has. For small files I can just unzip and read into python, for large files this is slow. Is there a way to count the columns quickly without loading the file into python?

Effeciently counting number of columns of text file is almost identical, but since my files are gzipped just reading the first line won't work. Is there a way to make python efficiently unzip just enough to read the first line?

score 4 · Accepted Answer · answered Sep 22 '17 at 02:54

4

... but since my files are gzipped just reading the first line won't work.

Yes it will.

import csv
import gzip

with gzip.open('file.tsv.gz', 'rt') as gzf:
    reader = csv.reader(gzf, dialect=csv.excel_tab)
    print(len(next(reader)))

answered Sep 22 '17 at 02:54

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

score 1 · Answer 2 · answered Sep 22 '17 at 05:27

This can be done with traditional unix command line tools. For example:

$ zcat file.tsv.gz | head -n 1 | tr $'\t' '\n' | wc -l

zcat (or gunzip -c) unzips and outputs to standard output, without modifying the file. 'head -n 1' reads exactly one line and outputs it. The 'tr' command replaces tabs with newlines, and 'wc -l' counts the number of lines. Because 'head -n 1' exits after one line, this has the effect of terminating the zcat command as well. It's quite fast. If first line of the file is a header, simply omit the 'wc -l' to see what the headers are.

Count columns of gzipped tsv without loading

2 Answers2