0

I am trying to write a regular expression to match an xml document. Reason I am not using a xml parser immediately is because the file might contain multiple xml files (well formed or not), so I would like to remove not well formed before parsing.

xml structure:

<company>
    .....
    <Employees>
    .......
    </Employees>
</company>

code

    final String xmlString = "...";
    final List<String> data = new ArrayList<String>();
    try
    {
        final Pattern pattern = Pattern.compile("<company>(.+?)</company>", Pattern.DOTALL);
        final Matcher matcher = pattern.matcher(xmlString);
        while (matcher.find())
        {
            final Pattern pattern1 = Pattern.compile("<Employees>(.+?)</Employees>", Pattern.DOTALL);// "+?"
            final Matcher matcher1 = pattern1.matcher(matcher.group(1));
            if (matcher1.find())
            {
                data.add(matcher1.group(1));
            }
        }
    }
    catch (final Exception e)
    {

    }

This works fine if the xml string contains one well formed or not well formed xml string. but this doesn't work when you have a not well formed xml followed by well formed xml.

<company>
    <Employees>

   </Employees>
<company>
    .....
    <Employees>
    .......
    </Employees>
</company>

In this scenario it returns the whole string than the well formed xml.

Please help thanks!!

Ikshvak
  • 205
  • 2
  • 8
  • 17
  • 3
    Uhh... No, you're doing it wrong. **Validate** your XML **before** parsing it. – hd1 Jul 01 '13 at 15:42
  • What is your `readBuilder`? shouldn't it be `xmlString`? – Sazzadur Rahaman Jul 01 '13 at 15:48
  • You definitively **need to validate** that your RegEx is actually correct! See [this awesome tool](http://gskinner.com/RegExr/) to accomplish that. Also keep in mind that in some cases you will need to escape special characters with a \. – Paul Jul 01 '13 at 15:48
  • yes readBuilder is xmlString. Updated. – Ikshvak Jul 01 '13 at 15:51
  • 3
    You should reject the entire input if it's not well-formed. [Trying to validate/parse XML (or HTML) with regex will fail](http://stackoverflow.com/a/1732454/222364), especially bad-formed XML. – Darth Android Jul 01 '13 at 15:53
  • Please see this answer: http://stackoverflow.com/a/1732454/71034 (yes, it applies to XML too.) – Dan Breslau Jul 01 '13 at 15:55
  • that is what I am trying to achieve, f the xml is not well formed, I don't want to add it to my list. – Ikshvak Jul 01 '13 at 15:56
  • Why aren't you using XML Parsing API? – Makky Jul 01 '13 at 15:58
  • I am using one, but the problem is if the string contains more than one xml string. The input from here goes into other existing code, so I can't modify that code, all I need is to construct a list of string with valid xml format which have only the substring Employees – Ikshvak Jul 01 '13 at 16:03

2 Answers2

2

Doing this with a single regular expression is never going to work.

Assuming that the start and end tags appear on separate lines, you need to process the XML one line at a time, keeping track of what you have seen and buffering input until you identify a complete valid subdocument.

Pseudocode:

buffer = ""
while (line = read_input)
{
    if tag=="<company>"
    {
        buffer = "" // discard whatever we have seen since it didn't end with </company>
        buffer += line
    }
    else if tag=="</company>"
    {
        buffer += line
        write buffer
        buffer = ""
    }
    else
        buffer += line
}

This is the general idea of how to approach the problem... the specifics could be improved (left as an exercise).

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • I am also using the same pattern after listening to the responses here. – Ikshvak Jul 01 '13 at 17:08
  • looping over the string until i find the end tag and go back on string till I find start tag. Similarly do until I find all data. – Ikshvak Jul 01 '13 at 17:08
0

You're parsing a language that is similar to XML, but not quite the same.

So the first thing you need to do is to specify the grammar of that language: what constructs is your parser going to accept?

Then you need to write your parser. Almost certainly, the grammar of your language will be recursive, which means it will be beyond the capability of regular expressions to parse it. You may be able to write a parser using tools such as JavaCC.

But you need to do some reading. If you're attempting to do this job using regular expressions, this suggests that you aren't aware of the basic computer science behind the problem you are tackling. If you're a smart hacker, you may be able to knock something up that works on most of your input documents, but it will always be at risk of falling over on the next one, unless you understand the theory and apply it.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164