61

I want to create an XML file which will be used to store the structure of a Java program. I am able to successfully parse the Java program and create the tags as required. The problem arises when I try to include the source code inside my tags, since Java source code may use a vast number of entity reference and reserved characters like &, < ,> , &. I am not able to create a valid XML.

My XML should go like this:

<?xml version="1.0"?>
<prg name="prg_name">
  <class name= "class_name>
    <parent>parent class</parent>
      <interface>Interface name</interface>
.
.
.
      <method name= "method_name">
        <statement>the ordinary java statement</statement>
        <if condition="Conditional Expression">
          <statement> true statements </statement>
        </if>
        <else>
          <statement> false statements </statement>
        </else>
        <statement> usual control statements </statement>
 .
 .
 .
      </method>
    </class>
 .
 .
 .
 </prg>

Like this, but the problem is conditional expressions of if or other statements have a lot of & or other reserved symbols in them which prevents XML from getting validated. Since all this data (source code) is given by the user I have little control over it. Escaping the characters will be very costly in terms of time.

I can use CDATA to escape the element text but it can not be used for attribute values containing conditional expressions. I am using Antlr Java grammar to parse the Java program and getting the attributes and content for the tags. So is there any other workaround for it?

Jens
  • 8,423
  • 9
  • 58
  • 78
Sudh
  • 1,265
  • 2
  • 19
  • 30

2 Answers2

105

You will have to escape

" to  &quot;
' to  &apos;
< to  &lt;
> to  &gt;
& to  &amp;

for xml.

Bala R
  • 107,317
  • 23
  • 199
  • 210
36

In XML attributes you must escape

" with &quot;
< with &lt;
& with &amp;

if you wrap attribute values in double quotes ("), e.g.

<MyTag attr="If a&lt;b &amp; b&lt;c then a&lt;c, it's obvious"/>

meaning tag MyTag with attribute attr with text If a<b & b<c then a<c, it's obvious - note: no need to use &apos; to escape ' character.

If you wrap attribute values in single quotes (') then you should escape these characters:

' with &apos;
< with &lt;
& with &amp;

and you can write " as is. Escaping of > with &gt; in attribute text is not required, e.g. <a b=">"/> is well-formed XML.

izogfif
  • 6,000
  • 2
  • 35
  • 25
  • 10
    Why does XML require that special characters inside the quotes be escaped in case of attribute values? Only " or ' would need to be quoted... and anything else inside that string could simply be considered as content! – Teddy Apr 16 '16 at 11:13
  • 2
    I guess it's a pre-caution against badly written XML parsers and / or incorrect XML. For example, if quotes for attributes are omitted (``). – izogfif Sep 20 '16 at 14:22
  • Not an expert but I would suspect this is an historical precaution due to SGML that was originally used to define HTML and other type markup langue. – LMA1980 Nov 24 '16 at 04:10
  • Even with modern parsers, the closing tag is the problem. Starting tag doesn't give any error. – Sorter Dec 21 '16 at 11:41
  • 9
    This is more correct than the accepted answer because it provides the minimal set of necessary escapes. – TToni Apr 27 '18 at 08:38