How to do multiple match and print different number of lines after each pattern using awk

Question

I have a big file with thousand lines that looks like:

>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00002235.4
TTACGCAT
TAGGCCAG
>ENST00005546.9
TTTATCGC
TTAGGGTAT

I want to grep specific ids (after > sign), for example, ENST00001234.1 then want to get lines after the match until the next > [regardless of the number of lines]. I want to grep about 63 ids in this way at once.

If I grep ENST00001234.1 and ENST00005546.9 ids, the ideal output should be:

>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT

I tried awk '/ENST00001234.1/ENST00005546.9/{print}' but it did not help.

You might be interested in an application called bioawk, developed on top of awk for these purposes. — Daemon Painter, Sep 10 '20 at 09:40

Sundeep · Accepted Answer · 2020-09-10T09:59:52.250

You can set > as the record separator:

$ awk -F'\n' -v RS='>' -v ORS= '$1=="ENST00001234.1"{print RS $0}' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC

-F'\n' to make it easier to compare the search term with first line
-v RS='>' set > as input record separator
-v ORS= clear the output record separator, otherwise you'll get extra newline in the output
$1=="ENST00001234.1" this will do string comparison and matches the entire first line, otherwise you'll have to escape regex metacharacters like . and add anchors
print RS $0 if match is found, print > and the record content

If you want to match more than one search terms, put them in a file:

$ cat f1
ENST00001234.1
ENST00005546.9

$ awk 'BEGIN{FS="\n"; ORS=""}
       NR==FNR{a[$0]; next}
       $1 in a{print RS $0}' f1 RS='>' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT

Here, the contents of f1 is used to build the keys for array a. Once the first file is read, RS='>' will change the record separator for the second file.

$1 in a will check if the first line matches a key in array a

RavinderSingh13 · Answer 2 · 2020-09-10T10:01:25.630

EDIT(Generic solution): In case one has to look for multiple strings in Input_file then mention all of them in awk variable search with ,(comma) separated and that should print all matched ones(respective lines).

awk -v search="ENST00001234.1,ENST00002235.4" '
BEGIN{
  num=split(search,arr,",")
  for(i=1;i<=num;i++){
    look[">"arr[i]]
  }
}
/^>/{
  if($0 in look){ found=1  }
  else          { found="" }
}
found
' Input_file

In case you want to read ids(which needs to be searched into Input_file) from another file then try following. Where look_file is the file which has all ids needs to be searched and Input_file is the actual content file.

awk '
FNR==NR{
  look[">"$0]
}
/^>/{
  if($0 in look){ found=1  }
  else          { found="" }
}
found
' look_file  Input_file

For single text search: Could you please try following. Written and tested with shown samples in GNU awk. One could give string which needs to be searched in variable search as per their requirement.

awk -v search="ENST00001234.1" '
/^>/{
  if($0==">"search){  found=1  }
  else             {  found="" }
}
found
' Input_file

Explanation: Adding detailed explanation for above.

awk -v search="ENST00001234.1" '     ##Starting awk program from here and setting and setting search variable value what we need to look.
/^>/{                                ##Checking condition if a line starts from > then do following.
  if($0==">"search){  found=1  }     ##Checking condition if current line equals to > search(variable value) then set found to 1 here.
  else             {  found="" }     ##else set found to NULL here.
}
found                                ##Checking condition if found is SET then print that line.
' Input_file                         ##Mentioning Input_file name here.

Timur Shtatland · Answer 3 · 2020-09-10T13:37:10.887

There is no need to reinvent the wheel. There are several bioinformatics tools for this task (extract fasta sequences using a list of sequence ids). For example, seqtk subseq:

Extract sequences with names in file name.lst, one sequence name per line:

seqtk subseq in.fq name.lst > out.fq

It works with fasta files as well. Use conda install seqtk or conda create --name seqtk seqtk to install the seqtk package, which has other useful functionalities, and is very fast.

SEE ALSO:

Retrieve FASTA sequences using sequence ids
Extract fasta sequences from a file using a list in another file
How To Extract A Sequence From A Big (6Gb) Multifasta File?
extract sequences from multifasta file by ID in file using awk

How to do multiple match and print different number of lines after each pattern using awk

3 Answers3

Linked