0

is there a way I can grep/awk multiple lines from a cluster based on a pattern?

I have a file as follows:

File.txt

>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice 
>Cluster2
1 Horse loves grass 
2 Giraffes love leaves
3 Hippos love water
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes 
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food
>Cluster5
1 rabbit eats carrot 
2 Leopards love running
3 Cats love food

And the pattern is - "Dogs", I would like the output to be:

>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes 
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food

Is this possible?

cms72
  • 177
  • 10

3 Answers3

3
perl -0777ne'print grep /\bDogs\b/, split /^(?=>)/m, $_' file

or

perl -ne'
   sub p { print $buf if $buf =~ /\bDogs\b/; }
   if (/^>/) { p(); $buf = ""; }
   $buf .= $_;
   END { p() }
' file

Notes:

  • The first version loads the entire file into memory (but not the second).
  • Both versions search the first line of the record as well as the subsequent lines.
  • You can place the second program all on one line if you want.
  • These address two bugs in karakfa's answer:
    • These don't add a leading blank line if the first record doesn't match.
    • These don't remove the final line feed if the last record doesn't match.
  • Related: Specifying file to process to Perl one-liner.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thank you ikegami! Worked like a charm! And thank you for the explanation - its really useful. – cms72 Oct 20 '20 at 02:36
  • 1
    I didn't really explain them. /// The first reads the whole file as a line (`-0777`), splits it on the zero-width string that precedes a `>` at the start of line, filters out the blocks that don't match the word `Dogs`, then print those that remain. /// The second accumulate lines until it sees `>` at the start of a line. Then, it prints the accumulated lines if they contain the word `Dogs`. Either way, the accumulated lines are cleared. The check is performed once more at the end of the file. – ikegami Oct 20 '20 at 02:39
  • Wow - Thanks ikegami! – cms72 Oct 20 '20 at 10:03
2

with multi-char RS support (i.e. gawk)

$ awk -v RS='(^|\n)>Cluster' -v ORS='' '/Dogs/{print rt $0} {rt=RT}' file

>Cluster1
1 rabbit eats carrot
2 Lion is the king of jungle
3 Dogs loves toys
4 Cats loves mice
>Cluster3
1 Snakes love trees
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running
2 Dogs love toys
3 Cats love food
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • @ikegami It won't capture the first one, where there is no preceding new line. – karakfa Oct 19 '20 at 23:41
  • Yeah, the default value for RS is a line feed, so saying the first line must start with the value of RS makes no sense. – ikegami Oct 19 '20 at 23:45
  • @ikegami it won't be captured as a record separator but part of the first record instead. Easy to test, for example not printing `rt` will show. – karakfa Oct 19 '20 at 23:45
  • @ikegami it's not just style, but idiomatic use. – karakfa Oct 19 '20 at 23:48
  • 1
    There's a small bug: Your solution displays a leading blank line if the first record doesn't match. (`/\bDogs\b/` is probably better than `/Dogs/`.) (This is probably a FASTA file, in which case `>` is better than `>Cluster`.) Deleted my other comments. – ikegami Oct 19 '20 at 23:50
  • Correct, it's either leading blank line or ending blank line with slight change with this approach. Pattern will match as a regex, for word match you can add the boundaries. Not clear in the question. – karakfa Oct 19 '20 at 23:54
  • Thanks karakfa - I should have made it clear in the question that it is a FASTA file. But I really appreciate your answer using awk - which I could use for other things! – cms72 Oct 20 '20 at 02:40
0

Demo code which takes an argument of search parameter, reads file line by line (low memory requirements).

Replace <DATA> with filehandle for processing a real input file.

use strict;
use warnings;

my $regex = shift || 'Dogs';
my $found = 0;
my $block;

while( <DATA> ) {
    next if /^\s*\z/;      # skip blank lines
    if( />Cluster\d+/ ) {
        print $block if $found;
        $found = 0;
        $block = $_;
    } else {
        $found = 1 if /\b$regex\b/;
        $block .= $_;
    }
}

print $block if $found;

__DATA__
>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice 
>Cluster2
1 Horse loves grass 
2 Giraffes love leaves
3 Hippos love water
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food
>Cluster5
1 rabbit eats carrot 
2 Leopards love running
3 Cats love food

Output

>Cluster1
1 rabbit eats carrot
2 Lion is the king of jungle
3 Dogs loves toys
4 Cats loves mice
>Cluster3
1 Snakes love trees
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running
2 Dogs love toys
3 Cats love food
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • That never prints the last record, even if it matches. (Also, its calling convention means you can't replace `` with `<>`, so it's not actually usable.) – ikegami Oct 20 '20 at 02:33
  • @ikegami -- fixed print of last found record. Why do you talking about `<>` when I referred to `<$fh>`? You need to check your head? – Polar Bear Oct 20 '20 at 06:02
  • I tried the code and it worked! But I have to give the point to ikegami for this one as its an easy one liner for the command line. But thank you so much for answering! – cms72 Oct 20 '20 at 09:36
  • And it did print the last record :) – cms72 Oct 20 '20 at 09:37
  • @cms72 Re "*And it did print the last record :)*", Try again. It will never print the last record (`Cluster5`), even if you add `Dogs` to it. – ikegami Oct 20 '20 at 10:17
  • Ah I see - I was thinking the last record being the last cluster with Dogs in it (i.e Cluster 4). But yes, youre right - if I add Dogs to Cluster 5, it won't print cluster 5. Sorry Polar Bear - but thank you anyway! – cms72 Oct 20 '20 at 21:15