0

I would like some advice to work around an xml parsing error. In my BLAST xml output, I have a description that has an '&' character which is throwing off the SearchIO.parse function.

If I run

qresults=SearchIO.parse(PLAST_output,"blast-xml")

for record in qresults:
    #do some stuff

I get the following error:

cElementTree.ParseError: not well-formed (invalid token): line 13701986, column 30

Which directs me to the this line:

<Hit_def>Lysosomal & prostatic acid phosphatases [Xanthophyllomyces dendrorhous</Hit_def>

Is there a way to override this in biopython so I do not have to change my xml file? Right now, I'm just doing a 'Try/Except' loop, but that is not optimal!

Thanks for your help! Courtney

  • https://stackoverflow.com/questions/1328538/how-do-i-escape-ampersands-in-xml-so-they-are-rendered-as-entities-in-html –  Jul 05 '18 at 16:58
  • Thanks Will - this is a possible solution. However blast XML files are rather large (>1GB) and I would rather not edit them prior to import. Ideally, I want to build a pipeline that I do not play around with the output file from another softwar. Can you think of another way? Thanks :)! – Courtney Stairs Jul 05 '18 at 17:47
  • Whoever generates them needs to generate valid xml. If that's not possible, you're going to have to do it. There's no magic solution here, unless you find a library that provides you a way (e.g., callbacks) to fix invalid xml. –  Jul 05 '18 at 19:57
  • 1
    Cool. Thanks. It would be really great if someone from Biopython could weigh in on this since the parser will work until it finds the offending character, maybe they have a callback library already? I just find it hard to believe that I am the first one to discover this issue -- thousands of people use BLAST routinely! Perhaps I should contact the BLAST developers and warn them their XML files are weird. For now, I've just 'sed'ed the offending characters out and everything is working. – Courtney Stairs Jul 06 '18 at 08:39
  • How did you generate your BLAST xml file? I just tried BLASTing XM_024724698.1 which has an & sign in its name and the character was properly escaped when downloading the XML from the NCBI page. – Maximilian Peters Jul 06 '18 at 17:05
  • I am generating it via the standalone package on my local machine. I did lie a bit when I said I was using BLAST -- its actually an alternative package called 'PLAST' which SAYS it produces a BLAST-xml format, that was apparently also a lie. PLAST runs faster than BLAST and is more sensitivity than other fast software like DIAMOND aligner. However, the more I look into this faster alternative, the more I think it is not ideal (or at least not it's xml output). Thanks for your help. I'll investigate alternative solutions :) – Courtney Stairs Jul 09 '18 at 08:05

0 Answers0