grep (bash) multi-line pattern

Question

In bash (4.3.46(1)) I have some multi-line so called fasta records where each record is initiated by on line with >name and the following lines DNA sequence ([AGCTNacgtn]), here three records:

>chr1
AGCTACTTTT
AGGGNGGTNN
>chr2
TTGNACACCC
TGGGGGAGTA
>chr3
TGACGTGGGT
TCGGGTTTTT

How do I use bash grep to get the second record ? In other languages one might use:

>chr2\n([AGCTNagctn]*\n)*

In Bash I was trying to use the ideas from here (among other SOs). This did not work:

grep -zo '>chr2[AGCTNacgtn]+' file

Result should be:

>chr2
TTGNACACCC
TGGGGGAGTA

SOLUTION

On my system this was the solution (Almost Cyrus' below, i.e. with out the pipe to a second grep . ):

grep -Pzo '>chr1\n[AGCTNacgtn\n]+' file

Cyrus · Accepted Answer · 2017-04-13T17:46:43.857

3

With GNU grep:

grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .

Output:

>chr2
TTGNACACCC
TGGGGGAGTA

edited Apr 13 '17 at 17:46

answered Apr 13 '17 at 17:35

Cyrus

84,225
14
89
153

Oh, I thought -z made `\n` not necessary ? – user3375672 Apr 13 '17 at 17:37
Only `grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file` work - with the final `grep .` my system says "Binary file (standard input) matches" whatever that means. – user3375672 Apr 13 '17 at 17:48
1

I've got '>chr2(\n[^>\n]+)+' --> no trailing new line – silel Apr 13 '17 at 17:49
@user3375672: Okay. I rolled my answer back to previous version. – Cyrus Apr 13 '17 at 17:52
1

@silel: It might depend on grep‘s version. I used version 2.6.3. – Cyrus Apr 13 '17 at 17:53
@Cyrus I certainly hope it does not depend on `grep`'s version (the regex grammar is the same) modulo any bug. My regex is actually different. With it you don't need to pipe to another `grep` process. – silel Apr 13 '17 at 17:59
@silel: Thank you. I'd overlooked that. – Cyrus Apr 13 '17 at 18:05

score 2 · Answer 2 · answered Apr 13 '17 at 17:35

2

You can use awk with custom RS:

awk -v n=2 -v RS='(^|\n)>' 'NR==n+1{print ">" $0}' file    
>chr2
TTGNACACCC
TGGGGGAGTA

answered Apr 13 '17 at 17:35

anubhava

761,203
64
569
643

clt60 · Answer 3 · 2017-04-13T17:56:30.400

1

You should install the FAST perl package. It contains many utilities directly usable from the shell for dealing with fasta files, like fashead or fastail (and much more)

after installing it is simple as:

fashead -n2 fastafile | fastail -n1

output

>chr2
TTGNA.....

or even simpler

fasgrep chr2 fastafile

with the same output...

edited Apr 13 '17 at 17:56

answered Apr 13 '17 at 17:48

clt60

62,119
17
107
194

score 0 · Answer 4 · answered Apr 13 '17 at 18:06

0

Try this -

grep 'chr2' -A 2 file
>chr2
TTGNACACCC
TGGGGGAGTA

answered Apr 13 '17 at 18:06

VIPIN KUMAR

3,019
1
23
34

score 0 · Answer 5 · answered Apr 09 '19 at 18:50

The best tool for working with multi-line records is awk.

In your case:

awk 'BEGIN{RS=">"} NR==2 {print RS$0}' input.txt

input.txt

>chr1
AGCTACTTTT
AGGGNGGTNN
>chr2
TTGNACACCC
TGGGGGAGTA
>chr3
TGACGTGGGT
TCGGGTTTTT

Explanation:

BEGIN{RS=">"} Initially set record separator to ">"

NR==2 filter for record #2 only

{print RS$0} print record #2 with the missing record separator back

score 0 · Answer 6 · answered Oct 20 '20 at 15:07

Created sedgrep mixed version to support in generic way... You could use this sedgrep shell command available at https://github.com/iamdvr/sedgrep-shell-util

Direct Link: https://github.com/iamdvr/sedgrep-shell-util/blob/main/sedgrep

For your case direct command is this...

cat <FILE_NAME> | sed -nr ':main; /^>.*chr2/ { :loop; p; n; /^>/ b main; b loop} '

sedgrep usage is as follows...

Default NEW_LINE_PATTERN is ^\[
Usage : 
    cat {INPUT_FILE_NAME}  | sedgrep  {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN} 
    cat {INPUT_FILE_NAME}  | sedgrep  {THREAD_OR_SEARCH_PATTERN} 
    sedgrep {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
    sedgrep {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
Example : 
    cat sampleInput.log | sedgrep 2016-05-23 DB_CONN
    cat sampleInput.log | sedgrep DB_CONN
    sedgrep 2016-05-23 DB_CONN sampleInput.log
    sedgrep DB_CONN sampleInput.log

grep (bash) multi-line pattern

6 Answers6

input.txt

Explanation:

Linked