So here is the issue. I am trying to parse a XML file of information from GenBank. This file contains information on multiple DNA sequences. I have this done already for two other xml formats from genbacnk (TINY xml and INSD xml), but pure xml gives me a headache. Here's how my program should work. Download an xml formated file that contains information on X number of sequences from GenBank. Run my perl script that searches through that xml file line by line and prints the information I want to a new file, in fasta format. Which is this: >Sequence_name_and_information\n sequences\n >sequence_name.... and on and on until you have all the sequences from the xml file. My issue though is that in pure xml the sequence itself comes before the identifier for the gene or locus of the sequences. The gene or locus of the sequences should go in the same line as the ">". Here is the code I have from the point of opening the file and parsing through it:
open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";
while ( my $lines = <INSD> ) {
foreach ($lines) {
if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
$lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
$lines =~ s/ //g;
chomp($lines);
print New_File ">$lines\_";
} elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
$lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
$lines =~ s/(\.|\?|\-| )/_/g;
$lines =~ s/_{2,}/_/g;
$lines =~ s/_{1,}$//;
$lines =~ s/^>*_{1,}//;
$lines =~ s/\s{2}//g;
chomp($lines);
print New_File "$lines\n";
} elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
$lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
$lines =~ s/ //g;
chomp($lines);
print New_File "$lines\n";
}
}
}
close INSD;
close New_File;
}
There are two places to find Gene/locus information. That info is found between either on of these two tags: LOCUS_NAME or GENE_NAME. There will be one, or the other. If one has info the other will be empty. In either case both need to be added to the end of the >....... line.
Thanks,
AlphaA
PS--I tried to print that info to a "file" by doing open "$NA", ">" the sequence to that, then moving on with the program, finding the gene info, printing it to the > line and then read the $NA file and printing it to the line right after the > line. I hope this is clear.