How do I grep/awk multiple lines from a cluster based on a pattern?

Question

is there a way I can grep/awk multiple lines from a cluster based on a pattern?

I have a file as follows:

File.txt

>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice 
>Cluster2
1 Horse loves grass 
2 Giraffes love leaves
3 Hippos love water
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes 
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food
>Cluster5
1 rabbit eats carrot 
2 Leopards love running
3 Cats love food

And the pattern is - "Dogs", I would like the output to be:

>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes 
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food

Is this possible?

ikegami · Accepted Answer · 2020-10-20T02:29:52.680

3

perl -0777ne'print grep /\bDogs\b/, split /^(?=>)/m, $_' file

or

perl -ne'
   sub p { print $buf if $buf =~ /\bDogs\b/; }
   if (/^>/) { p(); $buf = ""; }
   $buf .= $_;
   END { p() }
' file

Notes:

The first version loads the entire file into memory (but not the second).
Both versions search the first line of the record as well as the subsequent lines.
You can place the second program all on one line if you want.
These address two bugs in karakfa's answer:
- These don't add a leading blank line if the first record doesn't match.
- These don't remove the final line feed if the last record doesn't match.
Related: Specifying file to process to Perl one-liner.

edited Oct 20 '20 at 02:29

answered Oct 19 '20 at 23:55

ikegami

367,544
15
269
518

Thank you ikegami! Worked like a charm! And thank you for the explanation - its really useful. – cms72 Oct 20 '20 at 02:36
1

I didn't really explain them. /// The first reads the whole file as a line (`-0777`), splits it on the zero-width string that precedes a `>` at the start of line, filters out the blocks that don't match the word `Dogs`, then print those that remain. /// The second accumulate lines until it sees `>` at the start of a line. Then, it prints the accumulated lines if they contain the word `Dogs`. Either way, the accumulated lines are cleared. The check is performed once more at the end of the file. – ikegami Oct 20 '20 at 02:39
Wow - Thanks ikegami! – cms72 Oct 20 '20 at 10:03

score 2 · Answer 2 · answered Oct 19 '20 at 23:38

2

with multi-char RS support (i.e. gawk)

$ awk -v RS='(^|\n)>Cluster' -v ORS='' '/Dogs/{print rt $0} {rt=RT}' file

>Cluster1
1 rabbit eats carrot
2 Lion is the king of jungle
3 Dogs loves toys
4 Cats loves mice
>Cluster3
1 Snakes love trees
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running
2 Dogs love toys
3 Cats love food

answered Oct 19 '20 at 23:38

karakfa

66,216
7
41
56

@ikegami It won't capture the first one, where there is no preceding new line. – karakfa Oct 19 '20 at 23:41
Yeah, the default value for RS is a line feed, so saying the first line must start with the value of RS makes no sense. – ikegami Oct 19 '20 at 23:45
@ikegami it won't be captured as a record separator but part of the first record instead. Easy to test, for example not printing `rt` will show. – karakfa Oct 19 '20 at 23:45
@ikegami it's not just style, but idiomatic use. – karakfa Oct 19 '20 at 23:48
1

There's a small bug: Your solution displays a leading blank line if the first record doesn't match. (`/\bDogs\b/` is probably better than `/Dogs/`.) (This is probably a FASTA file, in which case `>` is better than `>Cluster`.) Deleted my other comments. – ikegami Oct 19 '20 at 23:50
Correct, it's either leading blank line or ending blank line with slight change with this approach. Pattern will match as a regex, for word match you can add the boundaries. Not clear in the question. – karakfa Oct 19 '20 at 23:54
Thanks karakfa - I should have made it clear in the question that it is a FASTA file. But I really appreciate your answer using awk - which I could use for other things! – cms72 Oct 20 '20 at 02:40

Polar Bear · Answer 3 · 2020-10-20T17:10:14.710

0

Demo code which takes an argument of search parameter, reads file line by line (low memory requirements).

Replace <DATA> with filehandle for processing a real input file.

use strict;
use warnings;

my $regex = shift || 'Dogs';
my $found = 0;
my $block;

while( <DATA> ) {
    next if /^\s*\z/;      # skip blank lines
    if( />Cluster\d+/ ) {
        print $block if $found;
        $found = 0;
        $block = $_;
    } else {
        $found = 1 if /\b$regex\b/;
        $block .= $_;
    }
}

print $block if $found;

__DATA__
>Cluster1
1 rabbit eats carrot 
2 Lion is the king of jungle
3 Dogs loves toys 
4 Cats loves mice 
>Cluster2
1 Horse loves grass 
2 Giraffes love leaves
3 Hippos love water
>Cluster3
1 Snakes love trees 
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running 
2 Dogs love toys
3 Cats love food
>Cluster5
1 rabbit eats carrot 
2 Leopards love running
3 Cats love food

Output

>Cluster1
1 rabbit eats carrot
2 Lion is the king of jungle
3 Dogs loves toys
4 Cats loves mice
>Cluster3
1 Snakes love trees
2 Sharks love fish
3 Tigers love bushes
4 Cats love toys
5 Dogs love food
>Cluster4
1 Leopards love running
2 Dogs love toys
3 Cats love food

edited Oct 20 '20 at 17:10

answered Oct 20 '20 at 00:57

Polar Bear

6,762
1
5
12

That never prints the last record, even if it matches. (Also, its calling convention means you can't replace `` with `<>`, so it's not actually usable.) – ikegami Oct 20 '20 at 02:33
@ikegami -- fixed print of last found record. Why do you talking about `<>` when I referred to `<$fh>`? You need to check your head? – Polar Bear Oct 20 '20 at 06:02
I tried the code and it worked! But I have to give the point to ikegami for this one as its an easy one liner for the command line. But thank you so much for answering! – cms72 Oct 20 '20 at 09:36
And it did print the last record :) – cms72 Oct 20 '20 at 09:37
@cms72 Re "*And it did print the last record :)*", Try again. It will never print the last record (`Cluster5`), even if you add `Dogs` to it. – ikegami Oct 20 '20 at 10:17
Ah I see - I was thinking the last record being the last cluster with Dogs in it (i.e Cluster 4). But yes, youre right - if I add Dogs to Cluster 5, it won't print cluster 5. Sorry Polar Bear - but thank you anyway! – cms72 Oct 20 '20 at 21:15

How do I grep/awk multiple lines from a cluster based on a pattern?

3 Answers3