3

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

The output I need is: [{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

Below is an approximation of the Java code I am using:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

How can I change my regex to only include data elements and ignore the rest?

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
Mocky
  • 7,768
  • 5
  • 28
  • 23

8 Answers8

51

XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.

Dour High Arch
  • 21,513
  • 29
  • 75
  • 90
  • 3
    I suspect that you are giving wrong information to assert that regex cannot be used for lightweight parsing of a simplistic and reliable subset xml. – Mocky Dec 02 '08 at 21:29
  • 11
    The bottom line is that "simplistic and reliable" xml will, at some point, change. When it changes, your lightweight parser will fail, and you will be back where you are now. You will try to fix your parser, but it will quickly become an unreadable rat's nest. – James Van Huis Dec 02 '08 at 21:44
  • 21
    No, I am giving correct information that regular grammars cannot express context-free grammars, it is mathematically impossible. Please read http://en.wikipedia.org/wiki/Chomsky_hierarchy. – Dour High Arch Dec 02 '08 at 22:49
  • 14
    Using regexes to parse XML always ends in tears. – James Sulak Dec 02 '08 at 22:51
  • 14
    Lets all agree that this is a bad idea, it will end in tears and become a rat's nest. But to give perspective to those who may see this answer later and think it true: parsing (for example) PNG files with an XML parser is impossible, parsing some XML with regex is merely ill advised. – Mocky Dec 03 '08 at 14:30
  • @DourHighArch then how do xml parsers parse xml without using regex? – SRN Aug 16 '12 at 01:03
  • 1
    @SRN regex is not the only pattern matching method – Joshua May 17 '13 at 15:46
  • @Joshua what pattern matching methods do XML parsers use? – zundi Aug 21 '15 at 16:05
  • For optimisation purposes in a critical part of a codebase I went from using an XML parser to using a regular expression, which ended up being over 3 times faster. The regex, being only about 100 characters, is easy to understand. So this advice "It cannot work" is a bit extreme. It works and if you need speed, a simple a regex might be a good solution. – laurent Jan 30 '18 at 08:40
  • I suspect those who advice against using a regex had to maintain horrible multi-thousand lines codebases with regex hacks all over the place to get the parsing working. I agree in this case a parser is better. I'd say use a regex only in very well understood cases, once you have benchmarks to show that there will be a clear speed improvement (and this improvement is needed) without completely sacrificing maintainability. – laurent Jan 30 '18 at 08:44
16

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\\1>");
if (matcher.find()) {
    String DataElements = matcher.group(1);
    Matcher matcher2 = regex2.matcher(DataElements);
    while (matcher2.find()) {
        list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
    } 
}
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
4

Use XPath instead!

activout.se
  • 6,058
  • 4
  • 27
  • 37
2

You really should be using an XML library for this.

If you have to use RE, why not do it in two stages? DataElements>.*?</DataElements then what you have now.

Greg
  • 316,276
  • 54
  • 369
  • 333
1

Is there any reason you're not using a proper XML parser instead of regex's? This would be trivial with the right library.

Alnitak
  • 334,560
  • 70
  • 407
  • 495
  • My suspicion is that this is trivial no matter what approach you take and I am unable to use an XML parser in this situation. – Mocky Dec 02 '08 at 20:43
1

Sorry to give you yet another "Don't use regex" answer, but seriously. Please use Commons-Digester, JAXP (bundled with Java 5+) or JAXB (bundled with Java 6+) as it will save you from a boatload of hurt.

Guðmundur Bjarni
  • 4,082
  • 1
  • 18
  • 14
1

You should listen to everyone. A lightweight parser is a bad idea.

However, if you are really that hard headed about it, you should be able to tweak your code to exclude the tags outside of the DataElements tag.

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private static final String START_TAG = "<DataElements>";
private static final String END_TAG = "</DataElements>";
private List<DataElement> listDataElements(String input) {
    String cs = input.substring(input.indexOf(START_TAG) + START_TAG.length(), input.indexOf(END_TAG);
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

This will fail horribly if the dataelements tag does not exist.

Once again, this is a bad idea, and you will likely be revisiting this piece of code some time in the future in the form of a bug report.

James Van Huis
  • 5,481
  • 1
  • 26
  • 25
  • Thank you for taking the time to put this together. But the Java String manipulation route is an entirely different approach. – Mocky Dec 03 '08 at 18:13
0

Try to parse the Reg Ex via a property file and create then pattern object. I sorted out the same issue I faced while injecting Reg Ex via xml beans.

Ex :- I needed to parse the Reg Ex '(.)(D[0-9]{7}.D[0-9]{9}.D[A-Z]{3}[0-9]{4})(.)' by injecting in Spring. But it didn't work. Once tried to use the same Reg Ex hard coded in a Java class it worked.

Pattern pattern = Pattern.compile("(.)(D[0-9]{7}.D[0-9]{9}.D[A-Z]{2}[0-9]{4})(.)"); Matcher matcher = pattern.matcher(file.getName().trim());

Next I tried to load that Reg Ex via property file while injecting it. It worked fine.

  p:remoteDirectory="${rawDailyReport.remote.download.dir}"
  p:localDirectory="${rawDailyReport.local.valid.dir}"
  p:redEx="${rawDailyReport.download.regex}"

And in the property file the property is defined as follows.

rawDailyReport.download.regex=(.)(D[0-9]{7}\.D[0-9]{9}\.D[A-Z]{2}[0-9]{4})(.)

This is because the values with place holders are loaded through org.springframework.beans.factory.config.PropertyPlaceholderConfigurer and it handles these xml sensitive characters internally.

Thanks, Amith