extracting attribute value in XML using regex

Question

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE ... ]> 
<abc-config version="THIS" id="abc">
...
</abc-config>

Hi all,

In the code above, how can I extract the value of version attribute using Regex in Groovy/Java?

Thanks.

There's something you should know... I don't know how to say this, but... be prepared for 10,000 lectures. Oh, and welcome to Stackoverflow. — harpo, Feb 07 '11 at 23:04
If by 'regex' you mean 'XPath', then you've come to the right place. — Paul Ruane, Feb 07 '11 at 23:10
Thanks, but using regex is a requirement. I do not want to use Xpath. — minirasher, Feb 07 '11 at 23:13
one way I can think of is to split the string at version=" and then again at " id=", but this seems sloppy and I am wondering if there is a better regex? — minirasher, Feb 07 '11 at 23:16
Unless this is a college assignment aimed at teaching regex, then why id using a regex a requirement? Surely the real requirement is to get the data of interest out of the XML in an elegant fashion? Regex will not help you achieve this goal. — Dónal, Feb 08 '11 at 09:13

score 2 · Answer 1 · edited May 23 '17 at 11:56

2

A regex to handle this could be something like:

/<\?xml version="([0-9.]+)"/

I'll spare you one of the 10000 lectures about not using a regex to parse markup languages.

Edit: The One whose Name cannot be expressed in the Basic Multilingual Plane, He compelled me.

edited May 23 '17 at 11:56

Community

1
1

answered Feb 07 '11 at 23:40

CanSpice

34,814
10
72
86

I have read on many occasions how it is not good to use RegEx to parse HTML or XML but I am compelled to do so since XmlParser, XmlSlurper, DOM, SAX nothing seems to be parsing my XML file which has a DOCTYPE declaration. Can you suggest a way around this? – minirasher Feb 07 '11 at 23:50
2

Post a question asking about that specific problem. Loads of properly-formed documents have DOCTYPE declarations and can be parsed. – CanSpice Feb 08 '11 at 00:10

tim_yates · Answer 2 · 2011-02-08T08:02:18.717

I know you asked for a regex, but what's wrong with this in Groovy?

Assuming the xml is something like:

def xml= '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE abc-config>
<abc-config version="THIS" id="abc">
  <node></node>
</abc-config>'''

Then I can parse it with:

def n = new XmlSlurper().parseText( xml )

And then this line:

println n.@version

Prints out "THIS"

If you are having problems with a more complex DOCTYPE failing to load, you can try disabling the DOCTYPE checker by either:

def parser = new XmlSlurper()
parser.setFeature( "http://apache.org/xml/features/nonvalidating/load-external-dtd", false )
parser.setFeature( "http://xml.org/sax/features/namespaces", false )
parser.parseText( xml )

or by using the constructor for XmlSlurper that takes 2 parameters so as to disable this checking

score 0 · Answer 3 · 2011-02-07T23:50:46.657

0

Not a java regex, Perl regex...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg

Note that this fails on many levels, I could fill the page with a proper regex, but I don't have the desire.

this fails too ...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg

so does this
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(["'])(.+?)\1[^>]*?\s*\/?>/sg

edited Feb 07 '11 at 23:50

answered Feb 07 '11 at 23:39

I am not looking for something so complicated and something that I cant understand, but many thanks – minirasher Feb 07 '11 at 23:51
1

@minirasher - "something so complicated that cannot be understood" is pretty much regex's raison d'etre – Dónal Feb 08 '11 at 09:15

extracting attribute value in XML using regex

3 Answers3