-4

I am using the following script to parse a database inside my database.

Few people asked about the input. It is a large file and I cannot paste all of it here , can you just check this http://www.unimod.org/xml/unimod.xml If no, would you give me an option to paste it somewhere that I can share it with you? I try to paste a bit of input here

GIST acetyl light PT and GIST acetyl light O-acetyl glyoxal-derived hydroimidazolone AA0048 RESID AA0049 RESID AA0041 RESID AA0052 RESID AA0364 RESID AA0056 RESID AA0046 RESID AA0051 RESID AA0045 RESID AA0354 RESID AA0044 RESID AA0043 RESID 11999733 PubMed PMID Chemical Reagents for Protein Modification 3rd edition, pp 215-221, Roger L. Lundblad, CRC Press, New York, N.Y., 2005 Book IonSource acetylation tutorial Misc. URL http://www.ionsource.com/Card/acetylation/acetylation.htm AA0055 RESID 14730666 PubMed PMID 15350136 PubMed PMID AA0047 RESID 12175151 PubMed PMID 11857757 PubMed PMID AA0042 RESID AA0050 RESID AA0053 RESID AA0054 RESID ACET FindMod PNAS 2006 103: 18574-18579 Journal http://dx.doi.org/10.1073/pnas.0608995103 MS/MS experiments of mass spectrometric c-ions (MS^3) can be used for protein identification by library searching. T3-sequencing is such a technique (see reference). Search engines must recognize this as a virtual modification. Top-Down sequencing c-type fragment ion AA0088 RESID AA0087 RESID AA0086 RESID AA0085 RESID AA0084 RESID AA0083 RESID AA0082 RESID AA0081 RESID AA0089 RESID AA0090 RESID AA0091 RESID AA0092 RESID AA0093 RESID AA0094 RESID AA0095 RESID AA0096 RESID AA0097 RESID AA0098 RESID AA0099 RESID AA0100 RESID AMID FindMod 14588022 PubMed PMID AA0117 RESID BIOT FindMod Carboxyamidomethylation 11510821 PubMed PMID 12422359 PubMed PMID Boja, E. S., Fales, H. M., Anal. Chem. 73 3576-82 (2001) Journal Creasy, D. M., Cottrell, J. S., Proteomics 2 1426-34 (2002) Journal 12203680 PubMed PMID Stark; Modification of proteins with cyanate. Meth Enz 25B, 579-584 (1972) Journal AA0343 RESID 10978403 PubMed PMID AA0332 RESID Smyth; Carbamylation of amino and tyrosine hydroxyl groups. J Biol Chem 242, 1579-1591 (1967) Journal IonSource carbamylation tutorial Misc. URL http://www.ionsource.com/Card/carbam/carbam.htm Carbamylation is an irreversible process of non-enzymatic modification of proteins by the breakdown products of urea isocyanic acid reacts with the N-term of a proteine or side chains of lysine and arginine residues Hydroxylethanone Carboxymethylation Protein which is post-translationally modified by the de-imination of one or more arginine residues; Peptidylarginine deiminase (PAD) converts protein bound to citrulline Convertion of glycosylated asparagine residues upon deglycosylation with PNGase F in H2O phenyllactyl from N-term Phe Citrullination FLAC FindMod AA0128 RESID CITR FindMod IonSource

I get this error

mismatched tag at line 13, column 3, byte 569 at /srv/myscr/script/../extern/cpan/lib/perl5/XML/Simple.pm line 391

The code that I used to parse the data is as follows and I would appreciate if one could tell me why I receive such a error and how to fix it.

After adding the code I get the following error

Fetching unimod.xml from unimod web site
Connecting to pipeline database
Emptying modifications table
Parsing XML
mismatched tag at line 13, column 3, byte 569 at /srv/myscr/script/../extern/cpan/lib/perl5/XML/Simple.pm line 39
nik
  • 2,500
  • 5
  • 21
  • 48
  • 1
    Would it be terribly difficult to change from `XML::Simple` to either of the standard and excellent [XML::LibXML](https://metacpan.org/pod/XML::LibXML) or [XML::Twig](https://metacpan.org/pod/XML::Twig)? The `XML::Simple` had had its place and value but this was a long time ago. It is dismissed in its own docs and even its author wrote a [tutorial on `XML::LibXML`](https://grantm.github.io/perl-libxml-by-example/). It may require a little work upfront but now you are looking into the rabbit hole of debugging an obsolete module (regardless of whether the problem is or not about it). – zdim Jul 17 '18 at 21:49
  • @zdim if I modify it to XML::LibXML , then I change use `use XML::Simple;`to use XML::LibXML`and `my $xs = new XML::Simple` to `my $xs = new XML::LibXML` unfortunatelly I get another error saying cannot locate object method "XMlin" via package "XML::LibXML" – nik Jul 17 '18 at 21:55
  • @zdim the same error for when I use `XML::Twig` – nik Jul 17 '18 at 21:57
  • I can't look at all that code right now, hopefully someone else will. If that's the only change you need then please do it. Look at docs to see how to do with other modules what you have now. (Do they have that exact "_XMlin_" capability?) – zdim Jul 17 '18 at 21:57
  • @nik: `XML::LibXML` has a completely different API, and you will need to rewrite your program. Why don't you start with the tutorial that **zdim** linked? – Borodin Jul 18 '18 at 00:33
  • I think *"Use `XML::Simple` DOM parser - Okay as `unimod.xml` is small"* is a little hopeful. It's 44,000 lines of XML! `XML::Twig` is good at processing large XML data, and I suggest you go for that. I think its documentation is also a little better than that of `XML::LibXML`. – Borodin Jul 18 '18 at 00:47
  • The error message implies that there's an error in the XML data at line 13, but the version of the file on the internet is fine. Have you tried running your program without a parameter to use the online data instead? If you particularly want to use your offline files then please post a couple of dozen lines from it in your question here. – Borodin Jul 18 '18 at 00:54
  • This problem has nothing to do with your code. It's your input data that is the problem. And without seeing that input data, we can't be much help. The error message is pretty clear, there's a mismatched tag at line 13, column 3 of your XML document. – Dave Cross Jul 18 '18 at 06:40
  • That message indicates an error in the XML document. – ikegami Jul 18 '18 at 09:41
  • Tip: `$FindBin::Bin` doesn't always work. You should use `$FindBin::RealBin` instead. – ikegami Jul 18 '18 at 09:42
  • 1
    Your question's code is neither minimal (contains crazy amounts of unrelated code), nor sufficient (doesn't include the data needed to produce the error). Please fix your question. – ikegami Jul 18 '18 at 09:44
  • Tip: Avoid XML::Simple. It's own documentation discourages its use. You can read up on why XML::Simple is one of the hardest XML parsers to use [here](https://stackoverflow.com/questions/33267765/why-is-xmlsimple-discouraged). – ikegami Jul 18 '18 at 09:45
  • @Borodin input is huge, check this linke which is the input http://www.unimod.org/xml/unimod.xml – nik Jul 18 '18 at 14:37
  • @ikegami thanks for the tip you gave me. I really tried all the above ones, I could not figure it out why it cannot be parsed with Simple and when I change to LibXML everything changes – nik Jul 18 '18 at 14:39
  • Are you running the program without any arguments (so it pulls a new copy of the file from the web site) or are you giving it the name of a previously downloaded file? If it's the latter, then I suspect your local file has been corrupted in some way. – Dave Cross Jul 18 '18 at 14:56
  • Can you give us an extract (say, lines 10 to 15) from the file that you are parsing? I don't mean the copy on the web site, I mean the local copy. – Dave Cross Jul 18 '18 at 14:58
  • @Dave Cross No I use the one from the web. There is not any saved locally. I will try to paste 10 lines if this web allow me because it always says I use a lot of code ... – nik Jul 18 '18 at 15:04
  • @Dave Cross I pasted it above dave – nik Jul 18 '18 at 15:05
  • No. Don't bother, I can see the version on the web. I think you need to change the program so it writes the data that it gets from the web to a file and examine that file. – Dave Cross Jul 18 '18 at 15:06
  • @Dave Cross would it be possible to help me with this? – nik Jul 18 '18 at 15:07
  • Er... that input that you've just pasted isn't XML. Where did it come from? – Dave Cross Jul 18 '18 at 15:07
  • @Dave Cross It is xml , check here http://www.unimod.org/downloads.html then on the bottom of the web, you can see xml, I simply click on it and pasted it above . – nik Jul 18 '18 at 15:09
  • The sample that you have added to your question is not XML. Look at it. You can see it's not XML - it contains no tags. – Dave Cross Jul 18 '18 at 15:12
  • @Dave Cross I downloaded it from http://www.unimod.org/downloads.html . 3 years ago it used to work – nik Jul 18 '18 at 15:16
  • Re "*I could not figure it out why it cannot be parsed with Simple and when I change to LibXML everything changes*", hum? XML::Simple (or rather, the parser it's using, XML::Parser) should be able to parse any valid XML. The problem with XML::Simple isn't with its ability to parse XML. – ikegami Jul 18 '18 at 16:12
  • Re "*input is huge*", So? Isolate the problem. It should be easy since the parser gave you the exact line and column numbers where the problem occurs – ikegami Jul 18 '18 at 16:14
  • @ikegami I think Dave is somehow right about the XML fetching. I am now trying to understand how to download the xml file and parse it with above code rather than fetching it through . if you know anyone, please let me know – nik Jul 18 '18 at 16:15
  • @ikegami I cannot fetch the xml I guess , i tried to give 10 lines of the xml file above but it seems that it is not xml . – nik Jul 18 '18 at 16:16
  • @ikegami I downloaded the xml file into the folder where this script is. Do you know how to use the script above to parse that xml instead of fetching it? The name of the file is unimod.xml – nik Jul 18 '18 at 16:30
  • @nik: If you pass the program the name of an XML file, then it will parse that file instead of downloading it from the web site. – Dave Cross Jul 18 '18 at 18:33
  • @Dave Cross I tried few things with no sucess. would it be possible to help me amending the above code? instead downloading it , just take it from the place that the script is with the name of myfile.xml and then parse it to the database ? – nik Jul 18 '18 at 18:52
  • @nik: I just told you exactly how to do that. There's no need to change the code, it already supports what you want to do. – Dave Cross Jul 18 '18 at 19:02
  • @Dave Cross You mean I change this line ? `my $response = $ua->get( "http://www.unimod.org/xml/unimod.xml" );` to this my $response = $ua->get( "unimod.xml" );` because I saved the input in the folder as the script – nik Jul 18 '18 at 19:04
  • 1
    @nik: No. There's no need to change the code at all. I'm sorry, but if you can't read the code or the conversation we've had here and work out what to do, then there's no point in helping you any further. – Dave Cross Jul 18 '18 at 19:06
  • 1
    @nik: I am not doing a PhD. My qualifications are BSc (hons) Computer Science & Cybernetics; PhD "Manipulating real-world objects in a virtual space" Reading 1980; PhD "Anticipating the missing pronoun" Kent 1982. It is you who are being disrespectful: I have never before seen such an example of neediness and requests for hand-holding as this question and your comments. *Stack Overflow* is a place for accomplished programmers to ask for help with issues that have eluded them. You are unable to use code and advice that is handed to you, and it is you who is wasting our time. – Borodin Jul 18 '18 at 20:29

1 Answers1

2

For future reference, here is a cut-down version of your code which is just enough to demonstrate the problem. This is what you should have shown us as part of your original question.

use strict;
use warnings;

use XML::Simple;
use LWP::UserAgent;

print "Fetching unimod.xml from unimod web site\n";

# Retrieve latest xml version of Unimod from the website
my $ua = LWP::UserAgent->new();
$ua->env_proxy();
my $response = $ua->get( "http://www.unimod.org/xml/unimod.xml" );

my $xml = $response->content;

print "Parsing XML\n";

# Use XML::Simple DOM parser - Okay as unimod.xml is small
# Force specificity and neutral losses into an array to simplify code
my $xs = new XML::Simple(
    KeyAttr    => { "umod:mod" => "+title" },
    ForceArray => [ "umod:specificity", "umod:NeutralLoss" ]
);
my $ref = $xs->XMLin( $xml );

See how I've removed all of the distractions about config files or updating the database. It just grabs the XML from the web site and parses it.

The bad news is that, for me, this works fine. It parses the XML without throwing any errors. For reference, I'm using XML::Simple version 2.25 and Perl 5.26.2.

It would be useful to know whether this program gives the same error as your original code when you run it.

As mentioned in a comment, it would also be interesting to see what XML you are actually getting from the web site. You can get that by taking the $xml variable and writing its contents to a file:

open my $xml_fh, '>', 'test.xml' or die $!;
print $xml_fh $xml;

Then, once you have run the code, you will have a file called test.xml which contains the XML that the web site has given you. You can examine line 13 of that file to determine what the error is.

For what it's worth, I suspect you're not getting XML back for some reason. I suspect that either the proxy on your network or the web site itself is blocking your attempts to pull down the data automatically and is returning a 404 or 503 HTML page to you. That's just a guess though and we won't know for sure unless you run the tests I've suggested above.

Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • thanks Dave, I have a question, can you extend the code you gave me so that I save the xml file on my desktop without running it using my original or should i add it after `my $ref = $xs->XMLin( $xml );`? – nik Jul 18 '18 at 15:38
  • @nik: Add my new lines immediately after `$xml = $response->content;`. – Dave Cross Jul 18 '18 at 15:40
  • I added the print in my question – nik Jul 18 '18 at 15:44
  • @nik: Sorry, there was a typo in my code. It should be `print $xml_fh $xml;`, not `print $xml_fh, $xml;` (no comma). But your print output shows what the problem is. That's HTML you're getting, not XML. – Dave Cross Jul 18 '18 at 15:48
  • @nik: Also, you've accidentally removed your code from the question. – Dave Cross Jul 18 '18 at 15:49
  • when I change to `print $xml_fh $xml;` I get the same error I pasted above. so what I must do then ? do you have any idea? – nik Jul 18 '18 at 15:52
  • @nik: But printed to the file, rather than the console? – Dave Cross Jul 18 '18 at 15:53
  • Yes I see a file created text.xml there and when I use nano text.xml, it is empty ! – nik Jul 18 '18 at 15:55
  • @nik: Ok, that's weird. Did you copy my code exactly? – Dave Cross Jul 18 '18 at 15:57
  • @nik: But it doesn't really matter where the data is written. We know what the problem is now. You're not getting XML back from the web site - you're getting HTML. – Dave Cross Jul 18 '18 at 15:58
  • @ Dave Cross This is what I copied $xml = $response->content; open my $xml_fh, '>', 'test.xml' or die $!; print $xml_fh $xml; – nik Jul 18 '18 at 15:58
  • @nik: I can't help thinking you're rather missing the point here. It doesn't matter where the error goes, as long as you can read the error. – Dave Cross Jul 18 '18 at 16:00
  • I really don't want to waste your valuable time but it seems like you are very experienced. I have been struggling with this for long time I could not figure out how to fix it. I do appreciate your thought and your help a lot. – nik Jul 18 '18 at 16:03
  • I see your point. Is there any way to download the xml and then parse it to database with above script rather than fetching it? actually we do have a proxy and I gave a port which one can download anything . Let me know your thought please – nik Jul 18 '18 at 16:14