I tried to grep a 1M row '|' separated file with 320K patterns from another file with piping to Ole Tange's parallel package and piping the matched results into another file. I am using Cygwin on Windows 7 with 24 cores and 16GB physical memory.
The command I used after going thru this link Grepping a huge file (80GB) any way to speed it up?
< matchReport1.dat parallel --pipe --block 2M LC_ALL=C grep --file=nov15.DAT > test.match
where matchReport1.dat is the 1M row '|' separated file and the 320K patterns are stored in nov15.DAT. The task manager activity hits all 24 cores and the amount of physical memory usage jumps to ~15GB and I start getting messages that grep memory has been exhausted.
I then tried to split the nov15.DAT patterns file into 10 smaller chunks and run grep of those
parallel --bar -j0 -a xaa "LC_ALL=C grep {} matchReport1.dat" > testxaa
but this just takes too long (only 1.6K out of 30K lines grepping took aout 15 minutes).
My nov15.DAT pattern file consists of strings like 'A12345M' and the file where this pattern needs to match i.e. matchReport1.dat has strings like 'A12345M_dfdf' and 'A12345M_02' so cannot use the -F option in grep. Could someone suggest a fix or any other option other than using databases?
Heres a sample
nov15.DAT -> http://pastebin.com/raw/cUeGcYLb
matchReport1.dat -> http://pastebin.com/raw/01KSGN6k