Regex a xml string

Question

What would be the correct way to find a string like this in a large xml:

<ser:serviceItemValues>
    <ord1:label>Start Type</ord1:label>
    <ord1:value>Loop</ord1:value>
    <ord1:valueCd/>
    <ord1:activityCd>iactn</ord1:activityCd>
 </ser:serviceItemValues>

1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:

private static Pattern LOOP_REGEX =
        Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);

Thanks

Thanks for all the comments, let me clarify a bit for those saying not to use regex. I don't care what the value is I am not trying to get it I just want to be sure it says loop, if it doesn't I will throw an exception. So I guess it is validation, but I can not modify the xsd. — arinte, Aug 26 '09 at 19:57
I believe everyone understands what you're trying to do. However, regular expressions are not the best solution. Markup is best left to parsers. — doomspork, Aug 26 '09 at 20:08

doomspork · Answer 1 · 2009-08-26T20:07:28.420

Regular expressions are not the best option when parsing large amounts of HTML or XML.

There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.

Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html

score 3 · Answer 2 · edited Nov 25 '09 at 14:02

Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.

I recommend using JDOM. It has an easy syntax. An example can be found here: http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html

If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.

score 3 · Accepted Answer · answered Aug 26 '09 at 20:29

3

Look up XPath, which is kinda like regex for XML. Sort of.

With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.

I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:

"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"

answered Aug 26 '09 at 20:29

izb

50,101
39
117
168

Stop upvoting this, you all know this is the wrong way to approach the problem :( – Esko Nov 25 '09 at 14:05
1

Why is this wrong? This is exactly what xpath is for, isn't it? – izb Nov 26 '09 at 09:50

score 1 · Answer 4 · answered Aug 26 '09 at 19:52

When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).

Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).

score 1 · Answer 5 · edited Nov 25 '09 at 13:57

1

As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.

edited Nov 25 '09 at 13:57

Peter Mortensen

30,738
21
105
131

answered Aug 27 '09 at 06:37

Hardy

18,659
3
49
65

Ups, you wanted to do it in Java. Well, xmlstar is still a cool tool. – Hardy Aug 27 '09 at 06:39

Regex a xml string

5 Answers5