dealing with a list of big .txt.gz files

Question

I'm new to dealing with big and compressed data so, I have some question regarding that

How can I read a list of big (vep.txt.gz) files in R or Python, then filter the content of the files based on two columns?

Should I decompress the files or just read them?

In python, I faced memory error because the files is very big

Thanks in advance

You should decide for one language (remove the other tag) and describe what "filter the content" means (because the right answer depends on this). — Michael Butscher, Jan 15 '23 at 22:31
I want to read multiple files then filter the content based on two columns (Two values) for example Virus: COVID & Type: Alpha, so I want to retrieve some of the rows then save them as a CSV file — Joman, Jan 15 '23 at 22:46
This type of question is better suited to https://bioinformatics.stackexchange.com/ @Maj — jared_mamrot, Jan 15 '23 at 22:51
Iterate over the lines of each file in sequence; if a line meets your criteria write the info to a separate file. — wwii, Jan 15 '23 at 22:58

J_H · Answer 1 · 2023-01-15T23:00:06.683

then filter the content

You didn't describe much about the filtering operation. I will assume you seek a needle.

Structure your python program so it accepts some limited number of CSV lines on sys.stdin, and use a bash pipeline like this:

$ zcat  < giant_haystack.csv.gz | grep NEEDLE | python my_prog.py

Or, move the filtering into your program, perhaps with an import re regex that looks for NEEDLE and discards non-matching lines. Or teach your python app about decompressing gzip inputs. Or consume filesystem space with $ gunzip giant_haystack.csv.gz and have python read-and-retain a subset of those lines.

Suppose the regex ,Virus,.*,Type, matches the initial header line of your CSV.

You can use wildcards or brace expansion to name a bunch of compressed input files:

HDR=",Virus,.*,Type,"
NEEDLE="COVID|Alpha"

for FILE in giant_haystack{1,2,3}.csv.gz
do
    zegrep "${HDR}|${NEEDLE}" < $FILE > small.csv
    python my_prog.py small.csv
done

Now your program is in a good position to read a smallish dataset and produce analytic results. The input filename is available as sys.argv[1].

I want to read multiple files then filter the content based on two columns (Two values) for example Virus: COVID & Type: Alpha, so I want to retrieve some of the rows then save them as a CSV file — Joman, Jan 15 '23 at 22:45
what did mean by (for FILE in giant_haystack{1,2,3}.csv.gz)?, the file's extension is .vep.txt.gz. @J_H — Joman, Jan 17 '23 at 13:00

dealing with a list of big .txt.gz files

1 Answers1