2

I’m trying to parse text from a file that comes in a pseudo XML format. I can get a DOM document out of it when it comes in the following structure:

<product>
    <product_id>234567</product_id>
    <description>abc</description>
</product>

The problem I’m running into happens when the structure is similar to the following:

<product>
    <product_id>234567</product_id>
    <description>abc</description>
    <quantity 1:2>
        <version>1.1</version>
    </quantity 1:2>
        <version>1.2</version>
    <quantity 2:2>
    </quantity 2:2>
</product>

It generates the following exception due to the space in <quantity 1:2>:

org.xml.sax.SAXParseException:[Fatal Error] :1:167: Element type " quantity " must be followed by either attribute specifications, ">" or "/>"

I can get around this by replacing the space with an underscore. The problem is the structure can be vary in size and include several child nodes with the same format (<node 1:x>) and the file can contain hundreds of structures to parse. Is there a class available that will parse text like this a return a tree-like object?

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
Mane
  • 47
  • 1
  • 6

3 Answers3

4

Preprocess the file and change elements with that x:y form to <element value="x:y"/> then your DOM/SAX parsers will not choke.

I would suggest using a regular expression to help but that way leads to madness.

Community
  • 1
  • 1
Kelly S. French
  • 12,198
  • 10
  • 63
  • 93
4

Your file is not an XML at all, and SAX is for XML (Simple API for XML). You should re-think your structure so you can do something like:

<quantity myAttr="1.2">
    <version>1.2</version>
</quantity>
<quantity myAttr="1.x">
    <version>1.1</version>
</quantity>
<version>1.0</version>

Or something like that.

Vicente Plata
  • 3,370
  • 1
  • 19
  • 26
1

It generates the following exception due to the space in <quantity 1:2>

This is not the root cause of the error, the root cause is, as people have already mentioned, your file format is not valid XML. A valid XML tag would look like <quantity attr1="val1" attr2="val2>.

It sounds like you have no control over the file format. In this case I think the easiest way is to preprocess your file into valid XML then have DOM/SAX parser to parse it:

FileInputStream file = new FileInputStream("pseudo.pxml");
ByteArrayOutputStream temp = new ByteArrayOutputStream();
int c = -1;

while ((c=file.read()) >= 0){
   temp.write(c);
}

String xml = new String(temp.toByteArray());
xml = xml.replaceAll("([^:\s]+:[^:\s]+)", "value=\"\\1\"");

ByteArrayInputStream xmlIn = new ByteArrayInputStream(xml.getBytes());

/* use xmlIn for your XML parsers */

Note that I did not test this code nor is it optimized; just wanted to give you an idea.

Alvin
  • 10,308
  • 8
  • 37
  • 49