0

I am using the Saxon Processor to transform a huge XML file (+7,000 lines) into an RSS 2.0 XML file.

I have no control of the input XML files, they're being pulled from a server and my XSL file is supposed to transform it into an RSS feed.

Occasionally in the input XML file there is an element containing a href like so,

  <A href="https://www.google.com/maps/preview?q=tehran+iran&ie=UTF-8&hq=&hnear=0x3f8e00491ff3dcd9:0xf0b3697c567024bc,Tehran,+Iran&gl=us&ei=24iMU-jvFNLNsQTwi4DgAQ&ved=0CKsBELYDMBQ&source=newuser-ws">(map)</A>

The Saxon Processor doesn't like a certain part of this string though. Here is the error message,

Error on line 837 column 62 of production.xml: SXXP0003: Error reported by XML parser: The reference to entity "ie" must end with the ';' delimiter. org.xml.sax.SAXParseException; systemId: file:/C:/XSLT/Test3/production.xml; lineNumber: 837; columnNumber: 62; The reference to entity "ie" must end with the ';' delimiter.

Based off of the error it appears the processor is getting the ie parameter in the URL string confused with an XML element.

Is there anything I could add into the RSS 2.0 XSL stylesheet that would tell the Saxon Processor to skip over lines like these? I actually do not need the information from <A>,

  <A href="https://www.google.com/maps/preview?q=tehran+iran&ie=UTF-8&hq=&hnear=0x3f8e00491ff3dcd9:0xf0b3697c567024bc,Tehran,+Iran&gl=us&ei=24iMU-jvFNLNsQTwi4DgAQ&ved=0CKsBELYDMBQ&source=newuser-ws">(map)</A>

So if I could skip over lines like these entirely and if that would resolve the error that would be great. Alternatively, if it's suspected that the Saxon Processor has a bug and another processor will not cause this problem that would be great as well (and if you could recommend a more appropriate processor).

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Kyle Bridenstine
  • 6,055
  • 11
  • 62
  • 100
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Dec 20 '14 at 00:25
  • There is no problem with Saxon, what you have is a badly constructed xml document. From http://www.w3.org/TR/REC-xml/#syntax "The ampersand character (&) and the left angle bracket (<) must not appear in their literal form" – Rnet Dec 20 '14 at 08:16
  • 2
    Two points. First, it's not Saxon that's complaining, it's the XML parser underneath. Secondly, it's right to complain. Your file is not XML. Trying to parse invalid XML is like trying to compile an incorrect Java program, the best you can hope for is good error messages. – Michael Kay Dec 20 '14 at 11:17

1 Answers1

3

The input XML is improper.. The & must be escaped.. You can correct your input by replacing all occurrences of & with &amp;.

And also, the other characters that you would have to escape if present in your XML are:

" with &quot;,

' with &apos;,

< with &lt;, and

> with &gt;

Lingamurthy CS
  • 5,412
  • 2
  • 13
  • 21
  • 4
    Not quite. `>` does not have to be escaped. The quote - single or double - is only problematic within attributes, **if** it conflicts with the surrounding quotes. http://www.w3.org/TR/REC-xml/#syntax – michael.hor257k Dec 20 '14 at 05:29