Use AWK to search through fasta file, given a second file containing sequence names

Question

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).

seq.fasta

>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC

name.txt

Clone_23
Clone_27-1

I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.

awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta

The problem is that I can only extract the sequence of first candidate in name.txt, like this

>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA

Can anyone help to fix one-line awk command above?

You radically changed the question after answers have been posted. I've rolled that back. Please post a new question in that case. (And show what you've tried to adapt from the answers you got here) — hek2mgl, Jul 20 '16 at 22:58
I can only post 1 question in 90 mins. Can I post new example in Answer session? — user2993059, Jul 20 '16 at 23:06
Well, it actually recommends using comments or re-edit question (since I need to use format to show the example) — user2993059, Jul 20 '16 at 23:07
No that would be deleted by moderators. What about using the 90mins to think about a solution on your own? I guess they are meant for that. Actually it's just 30mins, since you asked this an hour ago. — hek2mgl, Jul 20 '16 at 23:08
I'd recommend using the time to come up with a truly representative example so we don't waste more time trying to help you solve a problem you don't have. — Ed Morton, Jul 20 '16 at 23:20

hek2mgl · Accepted Answer · 2016-07-21T19:58:03.360

2

If it is ok or even desired to print the name as well, you can simply use grep:

grep -Ff name.txt -A1 a.fasta

-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line

If the names are not desired in output I would simply pipe to another grep:

above_command | grep -v '>'

An awk solution can look like this:

awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta

Better explained in a multiline version:

# True as long as we are reading the first file, name.txt
NR==FNR {
    # Store the names in the array 'n'
    n[$0]
    next
}

# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline

edited Jul 21 '16 at 19:58

answered Jul 20 '16 at 21:49

hek2mgl

152,036
28
249
266

1

Hmm, I wonder what that will do if/when the getline fails... ;-). If you MUST use getline (hint - you don't here) then at least protect yourself from failures with `if ( (getline line) > 0 ) print line`. See http://awk.freeshell.org/AllAboutGetline. – Ed Morton Jul 20 '16 at 22:22
Hey. I was thinking about that but I thought it is not an issue since if `getline` fails in this case, the fasta format is broken and it *indeed* does not contain the sequence. Isn't it? – hek2mgl Jul 20 '16 at 22:31
Ok, I see. In that case it would print the name which is not desired.. Let's assume the fasta file is correct! ;) – hek2mgl Jul 20 '16 at 22:34
getline can fail for more reasons than just reaching the end of the file, e.g. a simple one would be someone removes the input file in mid-processing. In a scenario like that you'd think the script would just stop processing and produce no more output but that's not what your script will do if you're using getline, it'll duplicate your previously read input. There's other weird stuff that makes it usually not the best choice. – Ed Morton Jul 20 '16 at 22:35
Wouldn't the kernel keep the file in memory as long as the file descriptor is open? When somebody overwrites the file it might be more problematic. But I'm not 100% certain about that. Need to refresh my knowledge there. But anyhow, your point is valid. Let me fix that. – hek2mgl Jul 20 '16 at 22:36
Idk, maybe that's a bad example, I was just trying to think of something simple that could cause a failure. I know it CAN fail, I'm just not sure of the scenarios. You also now might have to deal with a case where the FASTA file has 2 consecutive `>name` lines (is that a "broken" file? idk..) - as written if the first name matches a target then your script will skip over the second name even if it matches one of the targets. You also have to consider what happens if you have a 2nd file to process and need an FNR==1 test - the getline can cause awk to jump right over it. Too much to consider... – Ed Morton Jul 20 '16 at 22:37
1

@EdMorton Updated it. Thanks for your help! I like to put the `getline > 0` right into the condition. `awk` rocks! :) – hek2mgl Jul 20 '16 at 22:43
1

Sounds good. My main aversion to getline is that when I see/write a script using it then I have to mentally go through all of the checklists of side-effects to see if any of them can happen with negative consequences in the current application. Usually I end up deciding "it'll probably be OK" and about 50% of the time I'm right about that :-) but mainly I'd rather just not have to think about it so find it easier not to use it except in a couple of applications where it's clearly the right solution (e.g. a recursive descent parser). – Ed Morton Jul 20 '16 at 22:46
1

Good point. If the programmer needs to think about too much things he might forget some. A solution with as few as possible side-effects is definitely often the better way. Especially if somebody who didn't wrote the initial code has to modify it later. Didn't had a strong opinion about `getline` so far. I'm using it rarely. But I see your points. – hek2mgl Jul 20 '16 at 22:51
Looks like the OP just changed their requirements such that the getline solution would now be completely inappropriate - that's the other reason to avoid getline, the slightest requirements change and you're back at square 1. – Ed Morton Jul 20 '16 at 23:16
I see. Putting it in a loop, which would be my first intention here, would necessarily soak up the next `>NAME` line as well, meaning it doesn't really work in this form. – hek2mgl Jul 21 '16 at 07:43
1

Exactly. You'd need to duplicate the test for `>` that's already in the main body after the call to `getline`. It's evil, that's all I'm sayin' ;-). – Ed Morton Jul 21 '16 at 12:01

score 2 · Answer 2 · answered Jul 20 '16 at 22:27

2

$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC

answered Jul 20 '16 at 22:27

Ed Morton

188,023
17
78
185

If one sequence is broken into multiple lines, the command above can only read first line. That was why I changed default RS. See changed example above – user2993059 Jul 20 '16 at 22:53
I don't see any changes in your example. It'd be trivial to tweak this solution to work for multi-line records but I'm not inclined to do so as I can't imagine why you wouldn't have included that info in your question and in your example in the first place though - I'd hope it's obvious that to parse a file we need to know the format of the contents of the file! Do you have any other surprises in store for use like multiple consecutive `>` lines or.... Post a new question with accurate input/output and we'll take it from there. – Ed Morton Jul 20 '16 at 23:15

Use AWK to search through fasta file, given a second file containing sequence names

2 Answers2

Linked