0

I'm new to dealing with big and compressed data so, I have some question regarding that

How can I read a list of big (vep.txt.gz) files in R or Python, then filter the content of the files based on two columns?

Should I decompress the files or just read them?

In python, I faced memory error because the files is very big

Thanks in advance

Joman
  • 21
  • 5
  • You should decide for one language (remove the other tag) and describe what "filter the content" means (because the right answer depends on this). – Michael Butscher Jan 15 '23 at 22:31
  • I want to read multiple files then filter the content based on two columns (Two values) for example Virus: COVID & Type: Alpha, so I want to retrieve some of the rows then save them as a CSV file – Joman Jan 15 '23 at 22:46
  • This type of question is better suited to https://bioinformatics.stackexchange.com/ @Maj – jared_mamrot Jan 15 '23 at 22:51
  • Iterate over the lines of each file in sequence; if a line meets your criteria write the info to a separate file. – wwii Jan 15 '23 at 22:58

1 Answers1

1

then filter the content

You didn't describe much about the filtering operation. I will assume you seek a needle.

Structure your python program so it accepts some limited number of CSV lines on sys.stdin, and use a bash pipeline like this:

$ zcat  < giant_haystack.csv.gz | grep NEEDLE | python my_prog.py

Or, move the filtering into your program, perhaps with an import re regex that looks for NEEDLE and discards non-matching lines. Or teach your python app about decompressing gzip inputs. Or consume filesystem space with $ gunzip giant_haystack.csv.gz and have python read-and-retain a subset of those lines.


Suppose the regex ,Virus,.*,Type, matches the initial header line of your CSV.

You can use wildcards or brace expansion to name a bunch of compressed input files:

HDR=",Virus,.*,Type,"
NEEDLE="COVID|Alpha"

for FILE in giant_haystack{1,2,3}.csv.gz
do
    zegrep "${HDR}|${NEEDLE}" < $FILE > small.csv
    python my_prog.py small.csv
done

Now your program is in a good position to read a smallish dataset and produce analytic results. The input filename is available as sys.argv[1].

J_H
  • 17,926
  • 4
  • 24
  • 44
  • I want to read multiple files then filter the content based on two columns (Two values) for example Virus: COVID & Type: Alpha, so I want to retrieve some of the rows then save them as a CSV file – Joman Jan 15 '23 at 22:45
  • what did mean by (for FILE in giant_haystack{1,2,3}.csv.gz)?, the file's extension is .vep.txt.gz. @J_H – Joman Jan 17 '23 at 13:00