4

Let me start off by saying I don't want to print only the duplicate lines nor do I want to remove them.

I am trying to use grep with a pattern file to parse a large data file.

The Pattern file for example may look like this:

1243
1234
1234
1234
1354
1356
1356
1677

etc. with more single and duplicate entries.

The Input data file might look like this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ttttt   1555    bbbbbb
ppppp   1354    pppppp
yyyyy   3333    zzzzzz
qqqqq   1677    eeeeee
iiiii   4444    iiiiii

etc. for 27000 lines.

when i use

grep -f 'Patternfile.txt' 'Inputfile.txt' > 'Outputfile.txt'

I get an output file that resembles this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ppppp   1354    pppppp

how would can i get it to also report the duplicates so i end up with something like this?:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
ppppp   1354    pppppp


qqqqq   1677    zzzzzz

Additionally I would also like to print a blank line should a query in the pattern file not match a substring in the input file.

Thank you!

2 Answers2

2

One solution, not with grep, but with perl:

With patternfile.txt and inputfile.txt with data of your original post. Next content of script.pl should do the job (I assume that the string to match is the second column, otherwise it should be modified to use a regexp instead. This way is faster):

use warnings;
use strict;

## Check arguments.
die qq[Usage: perl $0 <pattern-file> <input-file>\n] unless @ARGV == 2;

## Open input files.
open my $pattern_fh, qq[<], shift @ARGV or die qq[Cannot open pattern file\n];
open my $input_fh, qq[<], shift @ARGV or die qq[Cannot open input file\n];

## Hash to save patterns.
my (%pattern, %input);

## Read each pattern and save how many times appear in the file.
while ( <$pattern_fh> ) { 
    chomp;
    if ( exists $pattern{ $_ } ) { 
        $pattern{ $_ }->[1]++;
    }   
    else {
        $pattern{ $_ } = [ $., 1 ];
    }   
}

## Read file with data and save them in another hash.
while ( <$input_fh> ) { 
    chomp;
    my @f = split;
    $input{ $f[1] } = $_; 
}

## For each pattern, search it in the data file. If it appears, print line those
## many times saved previously, otherwise print a blank line.
for my $p ( sort { $pattern{ $a }->[0] <=> $pattern{ $b }->[0] } keys %pattern ) { 
    if ( $input{ $p } ) { 
        printf qq[%s\n], $input{ $p } for ( 1 .. $pattern{ $p }->[1] );
    }   
    else {
         # Old behaviour.
         # printf qq[\n];

         # New requirement.
         printf qq[\n] for ( 1 .. $pattern{ $p }->[1] );
    }   
}

Run it like:

perl script.pl patternfile.txt inputfile.txt

And gives next output:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
ppppp   1354    pppppp


qqqqq   1677    eeeeee
Birei
  • 35,723
  • 2
  • 77
  • 82
  • Thanks, this is pretty close.. I'm also trying to have it print a blank line if the pattern isn't found. But this has been a ton of help so far! – PlutonicFriend Mar 26 '12 at 20:42
  • @PlutonicFriend: How would be with blank lines? Add it to your question to receive help. I will try too. – Birei Mar 26 '12 at 20:53
  • hmm, funny style with `qq[]` :) – gaussblurinc Mar 26 '12 at 20:59
  • @Birei Can i ask you which variable indexes the values that don't exist, im still de-bugging and if the pattern not found is a duplicate, only one line is printed instead of duplicate blank lines. – PlutonicFriend Mar 27 '12 at 00:39
  • @PlutonicFriend: You changed again your input and output of the original question. I've already updated the answer. The last `else` does that job. I've updated the code and commented the old behaviour. Next time try to give accurate requirements from the beginning, because it's not nice to be modifying the script for those small modifications each time. – Birei Mar 27 '12 at 08:15
  • sorry. It's what I meant the first time and then I realized I was misunderstood. my apologies. – PlutonicFriend Mar 27 '12 at 14:19
  • @PlutonicFriend: It's not important :-) But take into account that more users will be willing to help if they see clear requirements from the beginning. – Birei Mar 27 '12 at 14:39
1

You aren't so much greping for the patterns as you are left-joining the data in input to the data in pattern.

You can (mostly) accomplish this with join, a handy Unix utility I've come to know pretty well since I've been trying to solve a problem similar to yours.

There are a couple small differences, though.

First the command:

join -a 1 -2 2 <(sort Patternfile.txt) <(sort -k2,3 Inputfile.txt)

And explanation:

  • -a 1 means to also include unjoinable lines from file 1 (Patternfile.txt). I added this because you wanted to include "blank" lines for unmatchable rows, and this was the closest I could get.
  • -2 2 means to join on field 2 for file 2 (You can set the field for both -1 FIELD and -2 FIELD, the default is field 1). This is because the key you are joining on in Inputfile.txt is in the second column
  • <(sort Patternfile.txt) — the files must be sorted on the join field for join to work correctly.
  • <(sort -k2,2 Inputfile.txt) — sort input file from key 2 to key 2, inclusive

Output:

1234 yyyyy vvvvvv
1234 yyyyy vvvvvv
1234 yyyyy vvvvvv
1243 aatta qqqqqq
1354 ppppp pppppp
1356
1356
1677 qqqqq eeeeee

Differences

Slight differences between your specified output and this result:

  • It's sorted by the key order.
  • Unjoinable rows still contain their original key. If that's a problem, you can clear the unmatched rows by piping through a simple awk:

    ... | awk '{ if ($2 != "") print; else print ""  }'
    
Nicole
  • 32,841
  • 11
  • 75
  • 101