I would like to intersect a series of small files with a relatively large file. Following the many topics on stackoverflow and after some tests I choose to use this function which was the fastest on my data:
for file1 in ./myfiles*
do
# Do other things to create file1 and file2
# Then
grep -f $file1 file2.txt | awk -F '\t' '{print $1}' > myResults_$file1.txt
done
where file1 is a single-column file of 50 to 100000 lines and file2 is two-columns tab-delimited ~1 million lines file.
Ex:
file1
A
B
C
file2
A 1
B 2
C 3
I run the command on a cluster with 1 thread and 48Gb RAM. However I soon as it reaches a file1 bigger than 10000 lines it crashes with the following error:
slurmstepd: Job 3312063 exceeded memory limit (50359784 > 50331648), being killed
Can someone explain me why this command is storing such much in memory and how can I solve this issue?