then filter the content
You didn't describe much about the filtering operation.
I will assume you seek a needle
.
Structure your python program so it accepts
some limited number of CSV lines on sys.stdin
,
and use a bash
pipeline like this:
$ zcat < giant_haystack.csv.gz | grep NEEDLE | python my_prog.py
Or, move the filtering into your program, perhaps
with an import re
regex that looks for NEEDLE
and discards non-matching lines.
Or teach your python app about decompressing gzip inputs.
Or consume filesystem space with $ gunzip giant_haystack.csv.gz
and have python read-and-retain a subset of those lines.
Suppose the regex ,Virus,.*,Type,
matches
the initial header line of your CSV.
You can use wildcards or brace expansion to name
a bunch of compressed input files:
HDR=",Virus,.*,Type,"
NEEDLE="COVID|Alpha"
for FILE in giant_haystack{1,2,3}.csv.gz
do
zegrep "${HDR}|${NEEDLE}" < $FILE > small.csv
python my_prog.py small.csv
done
Now your program is in a good position to
read
a smallish dataset and produce analytic results.
The input filename is available as sys.argv[1]
.