3
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE ... ]> 
<abc-config version="THIS" id="abc">
...
</abc-config>

Hi all,

In the code above, how can I extract the value of version attribute using Regex in Groovy/Java?

Thanks.

minirasher
  • 31
  • 1
  • 2
  • 4
    There's something you should know... I don't know how to say this, but... be prepared for 10,000 lectures. Oh, and welcome to Stackoverflow. – harpo Feb 07 '11 at 23:04
  • If by 'regex' you mean 'XPath', then you've come to the right place. – Paul Ruane Feb 07 '11 at 23:10
  • I meant regex, not XPath – minirasher Feb 07 '11 at 23:11
  • Thanks, but using regex is a requirement. I do not want to use Xpath. – minirasher Feb 07 '11 at 23:13
  • one way I can think of is to split the string at version=" and then again at " id=", but this seems sloppy and I am wondering if there is a better regex? – minirasher Feb 07 '11 at 23:16
  • Unless this is a college assignment aimed at teaching regex, then why id using a regex a requirement? Surely the real requirement is to get the data of interest out of the XML in an elegant fashion? Regex will not help you achieve this goal. – Dónal Feb 08 '11 at 09:13

3 Answers3

2

A regex to handle this could be something like:

/<\?xml version="([0-9.]+)"/

I'll spare you one of the 10000 lectures about not using a regex to parse markup languages.

Edit: The One whose Name cannot be expressed in the Basic Multilingual Plane, He compelled me.

Community
  • 1
  • 1
CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • I have read on many occasions how it is not good to use RegEx to parse HTML or XML but I am compelled to do so since XmlParser, XmlSlurper, DOM, SAX nothing seems to be parsing my XML file which has a DOCTYPE declaration. Can you suggest a way around this? – minirasher Feb 07 '11 at 23:50
  • 2
    Post a question asking about that specific problem. Loads of properly-formed documents have DOCTYPE declarations and can be parsed. – CanSpice Feb 08 '11 at 00:10
2

I know you asked for a regex, but what's wrong with this in Groovy?

Assuming the xml is something like:

def xml= '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE abc-config>
<abc-config version="THIS" id="abc">
  <node></node>
</abc-config>'''

Then I can parse it with:

def n = new XmlSlurper().parseText( xml )

And then this line:

println n.@version

Prints out "THIS"


If you are having problems with a more complex DOCTYPE failing to load, you can try disabling the DOCTYPE checker by either:

def parser = new XmlSlurper()
parser.setFeature( "http://apache.org/xml/features/nonvalidating/load-external-dtd", false )
parser.setFeature( "http://xml.org/sax/features/namespaces", false )
parser.parseText( xml )

or by using the constructor for XmlSlurper that takes 2 parameters so as to disable this checking

tim_yates
  • 167,322
  • 27
  • 342
  • 338
0

Not a java regex, Perl regex...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg

Note that this fails on many levels, I could fill the page with a proper regex, but I don't have the desire.

this fails too ...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg

so does this
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(["'])(.+?)\1[^>]*?\s*\/?>/sg

  • I am not looking for something so complicated and something that I cant understand, but many thanks – minirasher Feb 07 '11 at 23:51
  • 1
    @minirasher - "something so complicated that cannot be understood" is pretty much regex's raison d'etre – Dónal Feb 08 '11 at 09:15