Intersect large file: memory requirement issue on cluster

Question

I would like to intersect a series of small files with a relatively large file. Following the many topics on stackoverflow and after some tests I choose to use this function which was the fastest on my data:

for file1 in ./myfiles*
do 
    # Do other things to create file1 and file2
    # Then
    grep -f $file1 file2.txt | awk -F '\t' '{print $1}' > myResults_$file1.txt
done

where file1 is a single-column file of 50 to 100000 lines and file2 is two-columns tab-delimited ~1 million lines file.

Ex:

file1

A
B
C

file2

A 1
B 2
C 3

I run the command on a cluster with 1 thread and 48Gb RAM. However I soon as it reaches a file1 bigger than 10000 lines it crashes with the following error:

slurmstepd: Job 3312063 exceeded memory limit (50359784 > 50331648), being killed

Can someone explain me why this command is storing such much in memory and how can I solve this issue?

Make sure you have no blank lines in `file1` - they match everything. Check like this https://stackoverflow.com/a/13506134/2836621 — Mark Setchell, Sep 25 '18 at 08:40
Are you sure that the grep/awk are running when the memory limit is hit? Or could it be some other code that is not shown? — Poshi, Sep 25 '18 at 09:58
There's nothing obvious in that script that would cause a memory issue. grep reads each file1 into memory 1 at a time but 100000 single-column lines should be no big deal. Other than that every line is handled 1 at a time. It'll be slow but that's the tradeoff for NOT using a lot of memory. — Ed Morton, Sep 25 '18 at 14:35
Thank you all. At least I have confirmation that it should in theory not come from here. I'll keep investigating. — Radek, Sep 25 '18 at 20:15

Intersect large file: memory requirement issue on cluster

0 Answers0