-3

I am new to scripting. Could you help me in finding my way to separate sequences based on the information in the header for example i have fasta file like this

ERR1897927.533;barcodelabel=R40_1193R_F61_799F; GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCG AAGCTAACGCATTAAGTATCCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGA ATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAA CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGG CCTACACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA ERR1897927.925;barcodelabel=R41_1193R_F62_799F;

Now i would like to separate sequences in to separate fasta files based on "barcodelabel" (just based on the header, not from sequence itself as i already removed the barcodes)

Please let me know the way,

May thanks in advance,

Best! Wasim

Wasim
  • 13
  • 2

1 Answers1

2

@Wasim welcome to stackoverflow, for bioinformatics related questions it is better to use bioinformatics page. I have written a python script to solve your problem on example file given below:

ERR1897927.533;barcodelabel=R40_1193R_F61_799F; 
GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCGAAGCTAACGCATTAAGTATCCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAA CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGGCCTACACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA
ERR1897927.925;barcodelabel=R41_1193R_F62_799F;
GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCGAAGCTAACGCATTAAGTATCCCGCCT  GGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAAC    GCGCAGAA   CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGGCCTACACACAGGTGCTGCATGGCTGT  CGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC   CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA
ERR1897927.925;barcodelabel=R42_1193R_F62_799F;
GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCG AAGCTAACGCATTAAGTATCCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAA CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGGCCTACACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC  CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA
ERR1897927.925;barcodelabel=R43_1193R_F62_799F;
GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCG    AAGCTAACGCATTAAGTATCCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAA CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGGCCTACACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA
ERR1897927.925;barcodelabel=R44_1193R_F62_799F;
GTAGTCCTAGCCCTAAACGATGGATACTTGGTGTGACTGGGATTGAATCCAGTCGTGCCG AAGCTAACGCATTAAGTATCCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAA  CCTTACCAGCGTTTGACATGGTAGGACGGTTTCCAGAGATGGATTCCTCCCCTTACGGGGCCTACACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTC CCGCAACGAGCGCAACCCTCGTCTTTAGTTGCCACCATTTAGTTGGGCACTCTAAAGAAA

The script is

#!/usr/bin/python3
import re 
fasta_file = open("fasta_file",'r')
chk = fasta_file.read()
k2=re.split(r'ERR\d+\.\d+;barcodelabel=R{0,9}.*;', chk, flags=re.MULTILINE)
line = [i.replace('\n','') for i in k2]
del line[0]
for i,name in enumerate(line):
    f = open("file"+str(i+1)+".txt","w")
    f.write(name+"\n")
    f.close() 

This will generate files based on number of fasta files that are separated by barcodelabel.

  • @wasim I don't understand what you are saying could you please add more details to your question after editing it? – Ammar Sabir Cheema Sep 27 '18 at 12:07
  • Here i updated: Thank you @Ammar! Before performing demultiplexing, could you please let me know how can i add Sample names (Root40, Soil51 etc.) at the start of the sequence headers (according to barcodelabel)? for example i have sequences headers like >ERR1897927.533;barcodelabel=R40_1193R_F61_799F; GTAGTCC..... >ERR1897927.925;barcodelabel=R41_1193R_F62_799F; GTAGTCC... And i want output like this >Root41;ERR1897927.533;barcodelabel=R40_1193R_F61_799F; GTAGTCC..... >Soil51;ERR1897927.925;barcodelabel=R41_1193R_F62_799F; GTAGTCC... Thank you! – Wasim Sep 27 '18 at 15:07
  • @Wasim I think its better to stay up-to your original question and then post your edition in another question so that it can be elaborative and more nearer to the code of conduct. – Ammar Sabir Cheema Sep 27 '18 at 17:18