0

What would be the correct way to find a string like this in a large xml:

<ser:serviceItemValues>
    <ord1:label>Start Type</ord1:label>
    <ord1:value>Loop</ord1:value>
    <ord1:valueCd/>
    <ord1:activityCd>iactn</ord1:activityCd>
 </ser:serviceItemValues>

1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:

private static Pattern LOOP_REGEX =
        Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);

Thanks

arinte
  • 3,660
  • 10
  • 45
  • 65
  • Thanks for all the comments, let me clarify a bit for those saying not to use regex. I don't care what the value is I am not trying to get it I just want to be sure it says loop, if it doesn't I will throw an exception. So I guess it is validation, but I can not modify the xsd. – arinte Aug 26 '09 at 19:57
  • 1
    I believe everyone understands what you're trying to do. However, regular expressions are not the best solution. Markup is best left to parsers. – doomspork Aug 26 '09 at 20:08

5 Answers5

4

Regular expressions are not the best option when parsing large amounts of HTML or XML.

There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.

Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html

doomspork
  • 2,302
  • 1
  • 17
  • 24
3

Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.

I recommend using JDOM. It has an easy syntax. An example can be found here: http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html

If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mkoryak
  • 57,086
  • 61
  • 201
  • 257
3

Look up XPath, which is kinda like regex for XML. Sort of.

With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.

I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:

"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
izb
  • 50,101
  • 39
  • 117
  • 168
1

When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).

Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).

djc
  • 11,603
  • 5
  • 41
  • 54
1

As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hardy
  • 18,659
  • 3
  • 49
  • 65