0

I have a "source.fasta" file that contains information in the following format:

>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG
>TRINITY_DN83_c0_g1_i1 len=371 path=[1:0-173 152:174-370] [-1, 1, 152, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTATCGAGTAAGTTTACAGTCGTAAATAAAGAATCTGCCTTGAACAAACCTTATTCCTTTTTATTCTAAAAGAGGCCTTTGCGTAGTAGTAACATAGTACAAATTGGTTTATTTAACGATTTATAAACGATATCCTTCCTACAGTCGGGTGAAAAGAAAGTATTCGAAATTAGAATGGTTCCTCATATTACACGTTGCTG
>TRINITY_DN83_c0_g1_i2 len=218 path=[1:0-173 741:174-217] [-1, 1, 741, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTACCGAGTAAGTTTCCAGTCGTAAATAAAGAATCTGCCAGATCGGA
>TRINITY_DN99_c0_g1_i1 len=326 path=[1:0-242 221:243-243 652:244-267 246:268-325] [-1, 1, 221, 652, 246, -2]
ATCGGTACTATCATGTCATATATCTAGAAATAATACCTACGAATGTTATAAGAATTTCATAACATGATATAACGATCATACATCATGGCCTTTCGAAGAAAATGGCGCATTTACGTTTAATAATTCCGCGAAAGTCAAGGCAAATACAGACCTAATGCGAAATTGAAAAGAAAATCCGAATATCAGAAACAGAACCCAGAACCAATATGCTCAGCAGTTGCTTTGTAGCCAATAAACTCAACTAGAAATTGCTTATCTTTTATGTAACGCCATAAAACGTAATACCGATAACAGACTAAGCACACATATGTAAATTACCTGCTAAA
>TRINITY_DN90_c0_g1_i1 len=1240 path=[1970:0-527 753:528-1239] [-1, 1970, 753, -2]
GTCGATACTAGACAACGAATAATTGTGTCTATTTTTAAAAATAATTCCTTTTGTAAGCAGATTTTTTTTTTCATGCATGTTTCGAGTAAATTGGATTACGCATTCCACGTAACATCGTAAATGTAACCACATTGTTGTAACATACGGTATTTTTTCTGACAACGGACTCGATTGTAAGCAACTTTGTAACATTATAATCCTATGAGTATGACATTCTTAATAATAGCAACAGGGATAAAAATAAAACTACATTGTTTCATTCAACTCGTAAGTGTTTATTTAAAATTATTATTAAACACTATTGTAATAAAGTTTATATTCCTTTGTCAGTGGTAGACACATAAACAGTTTTCGAGTTCACTGTCG
>TRINITY_DN84_c0_g1_i1 len=301 path=[1:0-220 358:221-300] [-1, 1, 358, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCACTTTTAAATCTTATAAACATGTGATCCGTCTGCTCATTTGGACGTTACTGCCCAAAGTTGGTACATGTTTCGTACTCACG
>TRINITY_DN84_c0_g1_i2 len=301 path=[1:0-220 199:221-300] [-1, 1, 199, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCAGCTAACCAAAGATAAGTGTCTTGGCTTGGTATCTACAGATCTCTTTTCGTAATTTCGTGAGTACGAAACATGTACCAACT
>TRINITY_DN72_c0_g1_i1 len=434 path=[412:0-247 847:248-271 661:272-433] [-1, 412, 847, 661, -2]
GTTAATTTAGTGGGAAGTATGTGTTAAAATTAGTAAATTAGGTGTTGGTGTGTTTTTAATATGAATCCGGAAGTGTTTTGTTAGGTTACAAGGGTACGGAATTGTAATAATAGAAATCGGTATCCTTGAGACCAATGTTATCGCATTCGATGCAAGAATAGATTGGGAAATAGTCCGGTTATCAATTACTTAAAGATTTCTATCTTGAAAACTATTTCTAATTGGTAAAAAAACTTATTTAGAATCACCCATAGTTGGAAGTTTAAGATTTGAGACATCTTAAATTTTTGGTAGGTAATTTTAAGATTCTATCGTAGTTAGTACCTTTCGTTCTTCTTATTTTATTTGTAAAATATATTACATTTAGTACGAGTATTGTATTTCCAATATTCAGTCTAATTAGAATTGCAAAATTACTGAACACTCAATCATAA
>TRINITY_DN75_c0_g1_i1 len=478 path=[456:0-477] [-1, 456, -2]
CGAGCACATCAGGCCAGGGTTCCCCAAGTGCTCGAGTTTCGTAACCAAACAACCATCTTCTGGTCCGACCACCAGTCACATGATCAGCTGTGGCGCTCAGTATACGAGCACAGATTGCAACAGCCACCAAATGAGAGAGGAAAGTCATCCACATTGCCATGAAATCTGCGAAAGAGCGTAAATTGCGAGTAGCATGACCGCAGGTACGGCGCAGTAGCTGGAGTTGGCAGCGGCTAGGGGTGCCAGGAGGAGTGCTCCAAGGGTCCATCGTGCTCCACATGCCTCCCCGCCGCTGAACGCGCTCAGAGCCTTGCTCATCTTGCTACGCTCGCTCCGTTCAGTCATCTTCGTGTCTCATCGTCGCAGCGCGTAGTATTTACG

There are close to 400,000 sequences in this file.

I have another file ids.txt in the following format:

>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
>TRINITY_DN15434_c5_g3_i1
>TRINITY_DN9323_c8_g3_i5
>TRINITY_DN11957_c1_g7_i1
>TRINITY_DN15373_c1_g1_i1
>TRINITY_DN22913_c0_g1_i1
>TRINITY_DN13029_c4_g5_i1

I have 100 sequence ids in this file. When I match these ids to the source file I want an output that gives me the match for each of these ids with the entire sequence.

For example, for an id:

>TRINITY_DN80_c0_g1_i1

I want my output to be:

>TRINITY_DN80_c0_g1_i1
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG

I want all hundred sequences in this format. I used this code:

while read p; do
echo ''$p >> out.fasta
grep -A 400000 -w $p source.fasta | sed -n -e '1,/>/ {/>/ !{'p''}} >> out.fasta
done < ids.txt

But my output is different in that only the last id has a sequence and the rest dont have any sequence associated:

>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
....
>TRINITY_DN10309_c6_g3_i1
>TRINITY_DN6990_c0_g1_i1
TTTTTTTTTTTTTGTGGAAAAACATTGATTTTATTGAATTGTAAACTTAAAATTAGATTGGCTGCACATCTTAGATTTTGTTGAAAGCAGCAATATCAACAGACTGGACGAAGTCTTCGAATTCCTGGATTTTTTCAGTCAAGAGATCAACAGACACTTTGTCGTCTTCAATGACACACATGATCTGCAGTTTGTTGATACCATATCCAACAGGTACAAGTTTGGAAGCTCCCCAGAGGAGACCATCCATTTCGATGGTGCGGACCTGGTTTTCCATTTCTTTCATGTCTGTTTCATCATCCCATGGCTTGACGTCAAGGATTATAGATGATTTAGCAATGAGAGCAGGTTTCTTCGATTTTTTGTCAGCATAAGCTTTCAGACGTTCTTCACGAATTCTGGCGGCCTCTGCATCCTCTTCCTCGTCGCCAGATCCGAATAGGTCGACGTCATCATCGTCGTCATCCTTAGCAGCGGGTGCAGGTGCTGTGGTGGTCTTTCCGCCAGCGGTCAGAGGGCTAGCTCCAGCCGCCCAGGATTTGCGCTCCTCGGCATTGTAGGAGGCAATCTGGTTGTACCACCGGAGAGCGTGGGGCAAGCTTGCGCTCGGGGCCTTGCCGACTTGTTGGAACACTTGGAAATCGGCTTGAGTTGGTGTGTAACCTGACACATAACTCTTATCAGCTAAGAAATTGTTAAGCTCATTAAGGCCTTGTGCGGTTTTAACGTCTCCTACTGCCATTTTTATTTAAAAAAGTAGTTTTTTTCGAGTAATAGCCACACGCCCCGGCACAATGTGAGCAAGAAGGAATGAAAAAGAAATCTGACATTGACATTGCCATGAAATTGACTTTCAAAGAACGAATGAATTGAACTAATTTGAACGG

I am only getting the desired output for the 100th id from my ids.txt. Could someone help me on where my script is wrong. I would like to get all 100 sequences when i run the script. Thank you

I have added google drive links to the files i am working with: ids.txt

Source.fasta

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
S.Chereddy
  • 29
  • 5
  • That's horribly inefficient, as well as buggy. You want to process the big file just once. – tripleee Feb 18 '19 at 04:54
  • could you elaborate a bit please, what do you think would be the right code here considering the source.fasta file size is 250 megabytes. – S.Chereddy Feb 18 '19 at 05:03
  • 1
    None of the IDs from the ids.txt file is contained in the source.fasta file. – Cyrus Feb 18 '19 at 05:12
  • The quoting in your script is wrong, you want double quotes around your variables generally. So `echo "$p"` and `sed -n -e '1,/^>/{;/>/!p;}'` which doesn't contain any variable at all. This is probably relatively harmless for your inputs, but you want to read up on proper quoting; see e.g. https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable/27701642 – tripleee Feb 18 '19 at 05:23
  • If your FASTA file always contains the sequence data on a single line, `grep -A 1` and no `sed` would suffice. You could read all the identifiers at once with `grep -Ff ids.txt -A 1 source.fasta` and maybe throw in `-w` for good measure too (thanks @Cyrus for the suggestion in your deleted answer). – tripleee Feb 18 '19 at 05:33
  • We need you to post sample input which we can use to test a potential solution against to see if it produces the expected output you also posted but you posted sample input that produces no output so now all we can do is try to make up sample input ourselves to test the tool we wrote ourselves to see if it produces output we ourselves think might be what you want - a terrible way to try to test that the given tool actually behaves as **you** want it to. – Ed Morton Feb 18 '19 at 07:03
  • @EdMorton I have added the links to the files so you can look at them. I should have taken time to give suitable working ids and matching sequences but I wanted a solution since the file was too large – S.Chereddy Feb 18 '19 at 17:05
  • No, don't add links to files. Simply create a [mcve] and post it in your question. See [ask] if that's not clear. – Ed Morton Feb 18 '19 at 17:21

6 Answers6

2

Repeatedly looping over a large file is inefficient; you really want to avoid running grep (or sed or awk) more than once if you can avoid it. Generally speaking, sed and Awk will often easily allow you to specify actions for individual lines in a file, and then you run the script on the file just once.

For this particular problem, the standard Awk idiom with NR==FNR comes in handy. This is a mechanism which lets you read a number of keys into memory (concretely, when NR==FNR it means that you are processing the first input file, because the overall input line number is equal to the line number within this file) and then check if they are present in subsequent input files.

Recall that Awk reads one line at a time and performs all the actions whose conditions match. The conditions are a simple boolean, and the actions are a set of Awk commands within a pair of braces.

awk 'NR == FNR { s[$0]; next }
    # If we fall through to here, we have finished processing the first file.
    # If we see a wedge and p is 1, reset it -- this is a new sequence
    /^>/ && p { p = 0 }
    # If the prefix of this line is in s, we have found a sequence we want.
    ($1$2 in s) || ($1 in s) || ((substr($1, 1, 1) " " substr($1, 2)) in s) {
        if ($1 ~ /^>./) { print $1 } else { print $1 $2 }; p = 1; next }
    # If p is true, we want to print this line
    p' ids.txt source.fasta >out.fasta

So when we are reading ids.txt, the condition NR==FNR is true, and so we simply store each line in the array s. The next causes the rest of the Awk script to be skipped for this line.

On subsequent reads, when NR!=FNR, we use the variable p to control what to print. When we see a new sequence, we set p to 0 (in case it was 1 from a previous iteration). Then when we see a new sequence, we check if it is in s, and if so, we set p to one. The last line simply prints the line if p is not empty or zero. (An empty action is a shorthand for the action { print }.)

The slightly complex condition to check if $1 is in s might be too complicated -- I put in some normalizations to make sure that a space between the > and the sequence identifier is tolerated, regardless of there was one in ids.txt or not. This can probably be simplified if your files are consistently formatted.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks for the quick reply. I tried this but the output i got is only for the last sequence id in my ids.txt file. out.fasta contains the last id and its sequence, it doesnt contain any of the other 99 sequences before it. – S.Chereddy Feb 18 '19 at 05:31
  • Do you have odd line endings in one or both of your files? If `cat -v` shows `^M` on some or all lines, you want to investigate how to get rid of those (and longer term, find an editor which does this correctly, or figure out if you can avoid using Windows to edit Unix text files). See also https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings – tripleee Feb 18 '19 at 05:35
  • as per [your previous question](https://stackoverflow.com/q/54698693/1745001) perhaps? – Ed Morton Feb 18 '19 at 07:07
1

Only with GNU grep and sed:

grep -A 1 -w -F -f ids.txt source.fasta | sed 's/ .*//'

See: man grep

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • Thanks for the reply. The output only gives one result `>TRINITY_DN6990_c0_g1_i1 TTTTTTTTTTTTTGTGGAAAAACATTGATTTTATTGAATTGTAAACTTAAAATTAGATTGGCTGCACATCTTAGATTTTGTTGAAAGCAGCAATATCAACAGACTGGACGAACAATGACACACATGATCTGCAGTTTGTTGATACCATATCCAACAGGTACAAGTTGCAATGAGAGCAGGTTTCTTCGATTTTTTGTCAGCATAAGCTTTCAGACGTAAGAAGGAATGAAAAAGAAATCTGACATTGACATTGCCATGAAATTGACTTTCAAAGAACGAATGAATTGAACTAATTTGAACGG`. This is the last id in the `ids.txt`. Could you tell me where the error is because the same thing happens for multiple scripts i.e. getting only the last id and its sequence. – S.Chereddy Feb 18 '19 at 06:08
  • Check both files for non printable characters: `cat -A file` – Cyrus Feb 18 '19 at 06:47
  • 1
    as per [your previous question](https://stackoverflow.com/q/54698693/1745001) perhaps? – Ed Morton Feb 18 '19 at 07:07
1
$ awk 'NR==FNR{a[$1];next} $1 in a{c=2} c&&c--' ids.txt source.fasta
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG

The above was run using your posted source.fasta and this ids.txt:

$ cat ids.txt
>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN80_c0_g1_i1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

First group all id's as one expression separated by | like this

cat ids.txt | tr '\n' '|' | awk "{print "\"" $0 "\""}'

Remove the last | symbol from the expression.

Now you can grep using the output you got from previous command like this

egrep -E ">TRINITY_DN14840_c10_g1_i1|>TRINITY_DN8506_c0_g1_i1|>TRINITY_DN12276_c0_g2_i1|>TRINITY_DN15434_c5_g3_i1|>TRINITY_DN9323_c8_g3_i5|>TRINITY_DN11957_c1_g7_i1|>TRINITY_DN15373_c1_g1_i1|>TRINITY_DN22913_c0_g1_i1|>TRINITY_DN13029_c4_g5_i1" source.fasta

This will print only the matching lines

Editing as per tripleee comments

Using the following is printing the output properly Assuming the ID and sequence are in different lined

tr '\n' '|' <ids.txt | sed 's/|$//' | grep -A 1 -E -f - source.fasta
Raghuram
  • 3,937
  • 2
  • 19
  • 25
  • I have one hundred ids that I am trying to pull, will i have to put each of them in the quotes? – S.Chereddy Feb 18 '19 at 05:35
  • That's the idea, `grep -E` should cope amicably with a few thousand. But this i9s error-prone if some of your sequence identifiers could contain characters which have a special regex interpretation (like `.`, `*`, or `[`). – tripleee Feb 18 '19 at 05:40
  • `egrep -E` is redundant; you want just `grep -E` or the legacy `egrep` without an explicit `-E`. – tripleee Feb 18 '19 at 05:41
  • This could be simplified to `tr '\n' '|' out.fasta` without any manual work involved. But see above comments for some complications. And switching to `grep -F` with each sequence identifier on a separate line would be both less error-prone and more efficient, as well as more elegant. – tripleee Feb 18 '19 at 05:42
  • The output for the script was `>TRINITY_DN6990_c0_g1_i1 len=887 path=[865:0-108 1861:109-132 975:133-270 1862:271-294 1137:295-363 1206:364-380 1223:381-886] [-1, 865, 1861, 975, 1862, 1137, 1206, 1223, -2]`. I did not get the sequence associated with it and again it gives only the last sequence id match. – S.Chereddy Feb 18 '19 at 05:55
  • And again, that sounds like your ID file contains DOS/Windows-specific control characters which prevent many Unix tools from operating correctly, because it is not a proper text file. Have you investigated with `cat -v`? Do you understand what people are trying to tell you is wrong? – tripleee Feb 18 '19 at 11:15
  • Yes, I did look into it and there is no line ending with ^M. I also tried to use dos2unix to convert it and then tried it but the same error occurred. I also checked the file type using file utility `ASCII text, with very long lines`. I used `vi:set list` and the output didnt show any weird characters either other than `$` for each line ending. – S.Chereddy Feb 18 '19 at 16:48
0

This might work for you (GNU sed):

sed 's#.*#/^&/{s/ .*//;N;p}#' idFile | sed -nf - fastafile

Convert the idFile into a sed script and run it against the fastaFile.

potong
  • 55,640
  • 6
  • 51
  • 83
  • This works fine as long as each record is always exactly two lines, but breaks down if the sequence is split over multiple lines. – tripleee Feb 18 '19 at 13:15
  • @tripleee from the limited example data it would appear that the second line of each record is variable length and there is no example of a record of more than two lines. – potong Feb 18 '19 at 14:27
  • That's certainly true. The FASTA format more generally allows for the sequences to be split over multiple lines, however. I'm pointing this out for the benefit of future visitors. – tripleee Feb 18 '19 at 14:35
0

the best way to do this is using either python or perl. I was able to make a script for extracting the ids using python as follows.

#script to extract sequences from a source file based on ids in another file
#the source is a fasta file with a header and a sequence that follows in one line
#the ids file contains one id per line
#both the id and source file should contain the character '>' at the beginning that siginifies an id

def main():

    #asks the user for the ids file 
    file1 = raw_input('ids file: ');
    #opens the ids file into the memory
    ids_file = open(file1, 'r');
    #asks the user for the fasta file
    file2 = raw_input('fasta file: ');
    #opens the fasta file into memory; you need your memory to be larger than the filesize, or python will hard crash
    fasta_file = open(file2, 'r');

    #ask the user for the file name of output file
    file3 = raw_input('enter the output filename: ');
    #opens output file with append option; append is must as you dont want to override the existing data
    output_file = open(file3, 'w');

    #split the ids into an array
    ids_lines = ids_file.read().splitlines()
    #split the fasta file into an array, the first element will be the id followed by the sequence
    fasta_lines = fasta_file.read().splitlines()

    #initializing loop counters
    i = 0;
    j = 0;

    #while loop to iterate over the length of the ids file as this is the limiter for the program
    while j<len(fasta_lines) and i<len(ids_lines):
            #if statement to match ids from both files and bring matching sequences
            if ids_lines[i] == fasta_lines[j]:
                #output statements including newline characters
                output_file.write(fasta_lines[j])
                output_file.write('\n')
                output_file.write(fasta_lines[j+1])
                output_file.write('\n')
                #increment i so that we go for the next id
                i=i+1;
                #deprecate j so we start all over for the new id
                j=0;
            else:
                #when there is no match check the id, we are skipping the sequence in the middle which is j+1
                j=j+2;

    ids_file.close()
    fasta_file.close()
    output_file.close()

main()`

The code is not perfect but works for any number of ids. I have tested for my samples which contained 5000 ids in one of them and the program worked fine. If there are improvements to the code please do so, I am a relatively newbie at programming so the code is a bit crude.

S.Chereddy
  • 29
  • 5