1

I have 23 gzipped files of genetic data, with between 3.8 and 24 million rows each. Each file has over 12,000 columns. I need to extract the rows where a variable in a particular column is above a certain value.

It's easy to do this by piping the file (let's call it ${HUGE_DATA_FILE}) from zcat to awk and printing the lines that meet the condition to a temporary file that is gzipped at the end. However, maybe 40% of the lines meet the condition, and the temporary file becomes immense. If I try this with multiple files in parallel, the non-gzipped temporary files rapidly take up all the available memory.

I wrote a script that processes the file in blocks: it reads through 100,000 rows of data, extracting the appropriate lines to a temporary file, then gzips the temporary file and appends it to an output file. The output is correct, but getting there is slow. Every time it starts the loop to process a new block (line 6), it begins reading ${HUGE_DATA_FILE}) at the beginning, which seems like a real waste of time.

1   BLOCK_SIZE=100000
2   START_CTR=1
3   END_CTR=$(( START_CTR + BLOCK_SIZE ))
4   while [ $START_CTR -lt $MAX_LINE ]
5   do
6       zcat ${HUGE_DATA_FILE} | tail -n +${START_CTR} | head -n ${BLOCK_SIZE} | awk -F'\t' '{ if($7 >= 0.4) print $0 }' >> ${TEMP_OUTPUT_FILE}
7       gzip ${TEMP_OUTPUT_FILE}
8       cat ${TEMP_OUTPUT_FILE}.gz >> ${OUTPUT_FILE}.gz
9       START_CTR=${END_CTR}
10      END_CTR=$(( START_CTR + BLOCK_SIZE ))
11      rm ${TEMP_OUTPUT_FILE}.gz
12  done

My questions:

  1. Is there a way to "pause" zcat | awk at intervals to perform the steps in lines 7-11 without making zcat start over again at the beginning of the file? For example, is it possible to embed lines 7-11 within an awk statement so that they get run every time NR is a multiple of 100000?
  2. Besides the problem of making huge temporary files, zcat is just pretty slow for files this size. However, for each of these 23 huge data files, there's an info file with the same number of lines. Instead of 12,000 columns, it has just a handful of columns, one of which has the variable I'm using to determine which lines to extract from the huge data file. It's possible for a script to read very rapidly through this info file and record the line numbers for the lines that need to be extracted from the huge data file. Is there some way to extract lines without actually reading through the huge data file to find the line endings? (Or, at least, is there a way of reading through the file that's faster than zcat?)
  3. Are there other clever ways of getting around the problems of speed and temporary file size?
Brian
  • 101
  • 1
  • 7
  • 1
    Why are you using a loop and temp files like that instead of a single pipeline like `zcat ${HUGE_DATA_FILE} | awk 'whatever' | gzip -c > ${OUTPUT_FILE}.gz`? – Ed Morton Apr 03 '22 at 02:46
  • In addition to a straightforward pipeline, consider using zstd instead of gzip. Faster and better compression, and lots of options that can be tweaked to improve compression of huge files. – Shawn Apr 03 '22 at 02:54
  • @Ed Morton: does that single pipeline gzip each line as it's printed by awk and append it to the (gzipped) output file? I thought that a single pipeline would generate a huge temporary file and then gzip it all at once, after zcat and awk were completed. I'm not an experienced bash programmer, so there's a lot I don't know about the internal mechanics of these commands. Thanks! – Brian Apr 03 '22 at 03:01
  • idk the internals of gzip either but it wouldn't create a huge temp file and I'd be shocked if it read all of the input into memory before producing output given it's typically used on huge files so that'd be a major design issue. I expect it'll simply do what you want. – Ed Morton Apr 03 '22 at 03:11

1 Answers1

3

Just do this instead of a loop and temp files:

zcat "$HUGE_DATA_FILE" | awk 'whatever' | gzip -c > "${OUTPUT_FILE}.gz"

By the way, please read Correct Bash and shell script variable capitalization and https://mywiki.wooledge.org/Quotes and copy/paste all of your shell scripts into http://shellcheck.net while learning.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thank you! This worked. I apologize for the long delay in replying-- I ended up implementing your suggested method in a few scripts and making sure that they did what they were expected to do before returning here. The links you've provided above are helpful-- I've copied them over to my notebook for easy reference. – Brian Apr 14 '22 at 20:17