3

I have a large tab-delimited file that has been gzipped, and I would like to know how many columns it has. For small files I can just unzip and read into python, for large files this is slow. Is there a way to count the columns quickly without loading the file into python?

Effeciently counting number of columns of text file is almost identical, but since my files are gzipped just reading the first line won't work. Is there a way to make python efficiently unzip just enough to read the first line?

Empiromancer
  • 3,778
  • 1
  • 22
  • 53

2 Answers2

4

... but since my files are gzipped just reading the first line won't work.

Yes it will.

import csv
import gzip

with gzip.open('file.tsv.gz', 'rt') as gzf:
    reader = csv.reader(gzf, dialect=csv.excel_tab)
    print(len(next(reader)))
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
1

This can be done with traditional unix command line tools. For example:

$ zcat file.tsv.gz | head -n 1 | tr $'\t' '\n' | wc -l

zcat (or gunzip -c) unzips and outputs to standard output, without modifying the file. 'head -n 1' reads exactly one line and outputs it. The 'tr' command replaces tabs with newlines, and 'wc -l' counts the number of lines. Because 'head -n 1' exits after one line, this has the effect of terminating the zcat command as well. It's quite fast. If first line of the file is a header, simply omit the 'wc -l' to see what the headers are.

JonDeg
  • 386
  • 3
  • 8