Parsing pseudo XML file in Java

Question

I’m trying to parse text from a file that comes in a pseudo XML format. I can get a DOM document out of it when it comes in the following structure:

<product>
    <product_id>234567</product_id>
    <description>abc</description>
</product>

The problem I’m running into happens when the structure is similar to the following:

<product>
    <product_id>234567</product_id>
    <description>abc</description>
    <quantity 1:2>
        <version>1.1</version>
    </quantity 1:2>
        <version>1.2</version>
    <quantity 2:2>
    </quantity 2:2>
</product>

It generates the following exception due to the space in <quantity 1:2>:

org.xml.sax.SAXParseException:[Fatal Error] :1:167: Element type " quantity " must be followed by either attribute specifications, ">" or "/>"

I can get around this by replacing the space with an underscore. The problem is the structure can be vary in size and include several child nodes with the same format (<node 1:x>) and the file can contain hundreds of structures to parse. Is there a class available that will parse text like this a return a tree-like object?

score 4 · Answer 1 · edited May 23 '17 at 12:03

4

Preprocess the file and change elements with that x:y form to <element value="x:y"/> then your DOM/SAX parsers will not choke.

I would suggest using a regular expression to help but that way leads to madness.

edited May 23 '17 at 12:03

Community

1
1

answered Apr 24 '11 at 17:45

Kelly S. French

12,198
10
63
93

score 4 · Answer 2 · answered Apr 24 '11 at 17:49

Your file is not an XML at all, and SAX is for XML (Simple API for XML). You should re-think your structure so you can do something like:

<quantity myAttr="1.2">
    <version>1.2</version>
</quantity>
<quantity myAttr="1.x">
    <version>1.1</version>
</quantity>
<version>1.0</version>

Or something like that.

score 1 · Accepted Answer · answered Apr 25 '11 at 09:26

It generates the following exception due to the space in <quantity 1:2>

This is not the root cause of the error, the root cause is, as people have already mentioned, your file format is not valid XML. A valid XML tag would look like <quantity attr1="val1" attr2="val2>.

It sounds like you have no control over the file format. In this case I think the easiest way is to preprocess your file into valid XML then have DOM/SAX parser to parse it:

FileInputStream file = new FileInputStream("pseudo.pxml");
ByteArrayOutputStream temp = new ByteArrayOutputStream();
int c = -1;

while ((c=file.read()) >= 0){
   temp.write(c);
}

String xml = new String(temp.toByteArray());
xml = xml.replaceAll("([^:\s]+:[^:\s]+)", "value=\"\\1\"");

ByteArrayInputStream xmlIn = new ByteArrayInputStream(xml.getBytes());

/* use xmlIn for your XML parsers */

Note that I did not test this code nor is it optimized; just wanted to give you an idea.

I created a class to pre-process the file into a valid XML. – Mane Mar 07 '12 at 22:59 — Mane, Mar 07 '12 at 22:59

Parsing pseudo XML file in Java

3 Answers3