3

I tried to grep a 1M row '|' separated file with 320K patterns from another file with piping to Ole Tange's parallel package and piping the matched results into another file. I am using Cygwin on Windows 7 with 24 cores and 16GB physical memory.

The command I used after going thru this link Grepping a huge file (80GB) any way to speed it up?

< matchReport1.dat parallel --pipe --block 2M LC_ALL=C grep --file=nov15.DAT > test.match

where matchReport1.dat is the 1M row '|' separated file and the 320K patterns are stored in nov15.DAT. The task manager activity hits all 24 cores and the amount of physical memory usage jumps to ~15GB and I start getting messages that grep memory has been exhausted.

I then tried to split the nov15.DAT patterns file into 10 smaller chunks and run grep of those

parallel --bar -j0 -a xaa "LC_ALL=C grep {} matchReport1.dat" > testxaa

but this just takes too long (only 1.6K out of 30K lines grepping took aout 15 minutes).

My nov15.DAT pattern file consists of strings like 'A12345M' and the file where this pattern needs to match i.e. matchReport1.dat has strings like 'A12345M_dfdf' and 'A12345M_02' so cannot use the -F option in grep. Could someone suggest a fix or any other option other than using databases?

Heres a sample

nov15.DAT -> http://pastebin.com/raw/cUeGcYLb

matchReport1.dat -> http://pastebin.com/raw/01KSGN6k

Community
  • 1
  • 1
andrnev
  • 410
  • 2
  • 12
  • Add to your question some lines of matchReport1.dat and some lines of nov15.DAT. – Cyrus Dec 29 '15 at 18:58
  • 1
    This may be helpful: [EXAMPLE: Grepping n lines for m regular expressions](http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions) – Cyrus Dec 29 '15 at 20:53

2 Answers2

1

I assume that you only want to compare strings from nov15.DAT with start of the second column from matchReport1.dat.

Try this: modify nov15.DAT to avoid comparing in every row from first to last character:

sed 's/.*/^"[^|]*"|"&/' nov15.DAT > mov15_mod1.DAT

And then use mov15_mod1.DAT with your parallel command.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • I understood the mod you proposed. I will get back with performance stats as soon as I can, based on the gnu parallel link you posted above. – andrnev Dec 31 '15 at 09:31
0

Not very accurate, but if the IDs in nov15 are unique and does not match other places in the line, then this might just work. And it is fast:

perl -F'\|' -ane 'BEGIN{chomp(@nov15=`cat nov15.DAT`);@m{@nov15}=1..$#nov15+1;} for $l (split/"|_/,$F[1]) { if($m{$l}) { print }}' matchReport1.dat 
Ole Tange
  • 31,768
  • 5
  • 86
  • 104