0

I have a file in the form (Input_fasta.txt)

>tr|A0A089QH62|A0A089QH62_MYCTU Histidine kinase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00865 PE=4 SV=1
MTATASGIAATAPNCGEASINDVPIAESERRYLGARSASEYGQEIPLW
>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
>tr|A0A089SBT4|A0A089SBT4_MYCTU Glycosyl transferase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_19775 PE=4 SV=1
MDTETHYSDVWVVIPAFNEAAVIGKVVTDVRSVFDHVVCVDDGSTDGTGDIARRSGAHLV
RHPINLGQGAAIQTGIEYARKQPGAQVFATFDGDGQHRVKDVAAMVDRLGAGDVDVVIGT
RFGRPVGKASASRPPLMKRIVLQTGARLSRRGRRLGLTDTNNGLRVFNKTVADGLNITMS
GMSHATEFIMLIAENHWRVAEEPVEVLYTEYSKSKGQPLLNGVNIIFDGFLRGRMPR
>tr|A0A089QKT1|A0A089QKT1_MYCTU TetR family transcriptional regulator OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00800 PE=4 SV=1
MSLTAGRGPGRPPAAKADETRKRILHAARQVFSERGYDGATFQEIAVRADLTRPAINHYF
ANKRVLYQEVVEQTHELVIVAGIERARREPTLMGRLAVVVDFAMEADAQYPASTAFLATT
VLESQRHPELSRTENDAVRATREFLVWAVNDAIERGELAADVDVSSLAETLLVVLCGVGF
YIGFVGSYQRMATITDSFQQLLAGTLWRPPT
>tr|I6YAB3|I6YAB3_MYCTU Iron ABC transporter permease OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_07380 PE=4 SV=1
MARGLQGVMLRSFGARDHTATVIETISIAPHFVRVRMVSPTLFQDAEAEPAAWLRFWFPD
PNGSNTEFQRAYTISEADPAAGRFAVDVVLHDPAGPASSWARTVKPGATIAVMSLMGSSR
FDVPEEQPAGYLLIGDSASIPGMNGIIETVPNDVPIEMYLEQHDDNDTLIPLAKHPRLRV
RWVMRRDEKSLAEAIENRDWSDWYAWATPEAAALKCVRVRLRDEFGFPKSEIHAQAYWNA
GRAMGTHRATEPAATEPEVGAAPQPESAVPAPARGSWRAQAASRLLAPLKLPLVLSGVLA
ALVTLAQLAPFVLLVELSRLLVSGAGAHRLFTVGFAAVGLLGTGALLAAALTLWLHVIDA
RFARALRLRLLSKLSRLPLGWFTSRGSGSIKKLVTDDTLALHYLVTHAVPDAVAAVVAPV
GVLVYLFVVDWRVALVLFGPVLVYLTITSSLTIQSGPRIVQAQRWAEKMNGEAGSYLEGQ
PVIRVFGAASSSFRRRLDEYIGFLVAWQRPLAGKKTLMDLATRPATFLWLIAATGTLLVA
THRMDPVNLLPFMFLGTTFGARLLGIAYGLGGLRTGLLAARHLQVTLDETELAVREHPRE
PLDGEAPATVVFDHVTFGYRPGVPVIQDVSLTLRPGTVTALVGPSGSGKSTLATLLARFH
DVERGAIRVGGQDIRSLAADELYTRVGFVLQEAQLVHGTAAENIALAVPDAPAEQVQVAA
REAQIHDRVLRLPDGYDTVLGANSGLSGGERQRLTIARAILGDTPVLILDEATAFADPES
EYLVQQALNRLTRDRTVLVIAHRLHTITRADQIVVLDHGRIVERGTHEELLAAGGRYCRL
WDTGQGSRVAVAAAQDGTR
>tr|L0T545|L0T545_MYCTU PPE family protein PPE7 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=PPE7 PE=4 SV=1
MSVCVIYIPFKGCVKHVSVTIPITTEHLGPYEIDASTINPDQPIDTAFTQTLDFAGSGTV
GAFPFGFGWQQSPGFFNSTTTPSSGFFNSGAGGASGFLNDAAAAVSGLGNVFTETSGFFN
AGGVGIRASKTSATCCRAGRT

and another file containing the pattern like(Pattern.txt)

I6WXB4

I6WXC3

I6WXK8

I need an output like

>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH

what I have done till now is grep -f Pattern.txt Input_fasta.txt

How to extend the output to next lines till I hit next ">" after the match ?

tried awk '/I6WXB4/{copy=1;next} />/{copy=0;next} copy' Input_fasta.txt which gave an output

MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH

but header is missing here.
oguz ismail
  • 1
  • 16
  • 47
  • 69
Haroon
  • 17
  • 1
  • 6

3 Answers3

2

In awk:

$ awk 'NR==FNR{a[$0]; next} $2 in a' pattern.txt FS="|" RS=">" input_fasta.tzt
tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    wrt quotes, here's a good rule of thumb to avoid surprises: Quote using `'`s unless you NEED to use `"`s (e.g. to expand a shell variable) and quote using `"`s until you NEED to have no quotes (e.g. for globbing). You'd be amazed how many times I've been bit by habitually using `"` instead of `'` :-(. – Ed Morton Sep 13 '16 at 14:44
0

Here’s a simple Python solution, using BioPython:

import sys
import re
from Bio import SeqIO

with open('pattern.txt', 'r') as f:
    patterns = '|'.join([re.escape(pattern.strip()) for pattern in f])

for record in SeqIO.parse('test.fa', 'fasta'):
    if re.search(patterns, record.id):
        SeqIO.write(record, sys.stdout, 'fasta')

Note that this requires a well-behaved patterns.txt file, i.e. one that doesn’t contain any empty lines.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
0

bash & sed solution:

while read pattern
do

if [ ! -z $pattern ] ; then

 sed -n "/\|$pattern\|/{:loop;p;n;/>/q;bloop;}" input.txt
fi

done < patternfile.txt

loops on pattern file (blank lines are skipped) and if finds the pattern, just read & print the file lines through the end or until it finds >

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and [is-it-possible-to-escape-regex-metacharacters-reliably-with-sed](http://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed) to understand **some** of the reasons you must never do that. – Ed Morton Sep 13 '16 at 13:39
  • @EdMorton what do you propose to avoid reading with `while` ? (there are a few patterns in the file). And what's wrong with this example since `pattern` is a group of letters, so no risk of regex conflict? I agree with you in the general case (I would have created a python script to get the best performance & safety) but in that simple case it works fine and quickly. – Jean-François Fabre Sep 13 '16 at 13:44
  • Absolute best case scenario with highly sanitized input (which this is not - how do you know someone won't add descriptive text like `tr|foo|foo_MYCTU 30S ribosomal protein (behaves like |I6WXB4|)` in input.txt?), it will be an order of magnitude slower than an awk script, lengthier to write, harder to understand and I'm guessing non-portable. Asking "why not" for this is like saying "I know I have a chainsaw sitting next to me but I can cut this tree down with my pocket knife, what's wrong with that?". – Ed Morton Sep 13 '16 at 13:56
  • FWIW if you WERE going to do this with sed then the correct shell code would be something like `xargs -n 1 -I {} < pattern.txt sed 'whatever' input.txt` but it's still the wrong approach. – Ed Morton Sep 13 '16 at 14:03