I have 23 gzipped files of genetic data, with between 3.8 and 24 million rows each. Each file has over 12,000 columns. I need to extract the rows where a variable in a particular column is above a certain value.
It's easy to do this by piping the file (let's call it ${HUGE_DATA_FILE}) from zcat to awk and printing the lines that meet the condition to a temporary file that is gzipped at the end. However, maybe 40% of the lines meet the condition, and the temporary file becomes immense. If I try this with multiple files in parallel, the non-gzipped temporary files rapidly take up all the available memory.
I wrote a script that processes the file in blocks: it reads through 100,000 rows of data, extracting the appropriate lines to a temporary file, then gzips the temporary file and appends it to an output file. The output is correct, but getting there is slow. Every time it starts the loop to process a new block (line 6), it begins reading ${HUGE_DATA_FILE}) at the beginning, which seems like a real waste of time.
1 BLOCK_SIZE=100000
2 START_CTR=1
3 END_CTR=$(( START_CTR + BLOCK_SIZE ))
4 while [ $START_CTR -lt $MAX_LINE ]
5 do
6 zcat ${HUGE_DATA_FILE} | tail -n +${START_CTR} | head -n ${BLOCK_SIZE} | awk -F'\t' '{ if($7 >= 0.4) print $0 }' >> ${TEMP_OUTPUT_FILE}
7 gzip ${TEMP_OUTPUT_FILE}
8 cat ${TEMP_OUTPUT_FILE}.gz >> ${OUTPUT_FILE}.gz
9 START_CTR=${END_CTR}
10 END_CTR=$(( START_CTR + BLOCK_SIZE ))
11 rm ${TEMP_OUTPUT_FILE}.gz
12 done
My questions:
- Is there a way to "pause" zcat | awk at intervals to perform the steps in lines 7-11 without making zcat start over again at the beginning of the file? For example, is it possible to embed lines 7-11 within an awk statement so that they get run every time NR is a multiple of 100000?
- Besides the problem of making huge temporary files, zcat is just pretty slow for files this size. However, for each of these 23 huge data files, there's an info file with the same number of lines. Instead of 12,000 columns, it has just a handful of columns, one of which has the variable I'm using to determine which lines to extract from the huge data file. It's possible for a script to read very rapidly through this info file and record the line numbers for the lines that need to be extracted from the huge data file. Is there some way to extract lines without actually reading through the huge data file to find the line endings? (Or, at least, is there a way of reading through the file that's faster than zcat?)
- Are there other clever ways of getting around the problems of speed and temporary file size?