1

So here is the issue. I am trying to parse a XML file of information from GenBank. This file contains information on multiple DNA sequences. I have this done already for two other xml formats from genbacnk (TINY xml and INSD xml), but pure xml gives me a headache. Here's how my program should work. Download an xml formated file that contains information on X number of sequences from GenBank. Run my perl script that searches through that xml file line by line and prints the information I want to a new file, in fasta format. Which is this: >Sequence_name_and_information\n sequences\n >sequence_name.... and on and on until you have all the sequences from the xml file. My issue though is that in pure xml the sequence itself comes before the identifier for the gene or locus of the sequences. The gene or locus of the sequences should go in the same line as the ">". Here is the code I have from the point of opening the file and parsing through it:

open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";

    while ( my $lines = <INSD> ) {
        foreach ($lines) {
            if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
                $lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
                $lines =~ s/ //g;
                chomp($lines);
                print New_File ">$lines\_";
            } elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
                $lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
                $lines =~ s/(\.|\?|\-| )/_/g;
                $lines =~ s/_{2,}/_/g;
                $lines =~ s/_{1,}$//;
                $lines =~ s/^>*_{1,}//; 
                $lines =~ s/\s{2}//g;
                chomp($lines);
                print New_File "$lines\n";
            } elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
                $lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
                $lines =~ s/ //g;
                chomp($lines);
                print New_File "$lines\n";
            }
        }
    }
    close INSD;
    close New_File;
}

There are two places to find Gene/locus information. That info is found between either on of these two tags: LOCUS_NAME or GENE_NAME. There will be one, or the other. If one has info the other will be empty. In either case both need to be added to the end of the >....... line.

Thanks,

AlphaA

PS--I tried to print that info to a "file" by doing open "$NA", ">" the sequence to that, then moving on with the program, finding the gene info, printing it to the > line and then read the $NA file and printing it to the line right after the > line. I hope this is clear.

runrig
  • 6,486
  • 2
  • 27
  • 44
AlphaA
  • 113
  • 2
  • 9
  • 12
    Is there a reason you're choosing not to use an XML parsing library? – Brian Roach Oct 11 '11 at 20:12
  • Maybe an input, current output and real output examples would be nice. – DavidEG Oct 11 '11 at 20:21
  • @DavidEG: Well, OP specified GenBank, so http://www.ncbi.nlm.nih.gov/genbank/ and ftp://ftp.ncbi.nih.gov/genbank/ – derobert Oct 11 '11 at 20:28
  • @Brian Roach- two reasons, trying to learn perl and get practice, and this program is for a bunch of us in a molecular biology lab, not everyone has all the libraries and I can't take the time to get everyone who wants to use it set up with all the libraries. to DavidEG, what do you mean by current output and real output? Please keep in mind I am a molecular biologist dabbling in perl for my needs. here is the input files: http://www.ncbi.nlm.nih.gov/nuccore go there, search for boletus, select a few sequences, go to send to (top right) chose, file, xml. – AlphaA Oct 11 '11 at 20:30
  • 4
    @AlphaA: Several of the XML parsing libraries are so common I'd be surprised if you have a Perl install without them. Some are also pure perl (no XS) so you could actually put them together with your script in a tarball. – derobert Oct 11 '11 at 20:37
  • Please read the first answer at http://stackoverflow.com/questions/1732348 to understand why parsing XML with regular expressions is a "Really Bad Idea". Then, for more sane explanations just search SO for the tags [xml][regex] (Yes, I know the answer is more about XHTML, but it applies equally to XML) – Jim Garrison Oct 11 '11 at 21:32

2 Answers2

4

In my opinion you should use XSLT with XPath to navigate to the data you need.

As @Brian suggests, it is easier to use established XML parsing techniques and libraries.

There is even a Perl library for XSLT

Community
  • 1
  • 1
Heinrich Filter
  • 5,760
  • 1
  • 33
  • 34
4

Use an XML parser. I'm not a biologist, and I'm not sure of the final format you want, but it should be simple with this as a starting point. $_[1] in the anonymous sub contains a hash reference with, from what I can tell above, everything that I think you want saved from parsing the parent tag of the tags you want. It should be easy to print out the elements of $_[1] in the format that you want it to be in:

use strict;
use warnings;

use XML::Rules;
use Data::Dumper;

my @rules = (
  _default => '',
  'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
  INSDSeq  => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);

my $p = XML::Rules->new(rules => \@rules);
$p->parsefile('sequence.gbc.xml');

And that is just so that printing just the tags you want is easy. Or, if you want some other tags, What I really might do is this (you don't really need the @tags variable at all if you're just printing element by element):

my @tags = qw(
  INSDSeq_locus
  INSDSeq_organism
  INSDSeq_sequence
);

my @rules = (
  _default => 'content',
  # Elements are, e.g. $_[1]{INSDSeq_locus}
  INSDSeq  => sub { print "$_: $_[1]{$_}\n" for @tags; return; },
);

with:

my $p = XML::Rules->new(rules => \@rules, stripspaces => 4);
runrig
  • 6,486
  • 2
  • 27
  • 44
  • No need for the `@tags` array and the map. You can specify several tag names in a string literal and separate them by a comma: my @rules = ( 'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content', ... – Jenda Oct 01 '12 at 22:24