XML Regex Search - Find specific blocks of code

Question

I have trouble to look for specific block of code in an XML file.

The XML is similar to this sample:

<object>
   <class>File</class>
   <name>Fall</name>
<desc>Description of Seasons: Fall</desc>
</object>

<object>
   <class>File</class>
   <name>Summer</name>
<desc>Description of Seasons: Summer</desc>
</object>

<object>
   <class>Image</class>
   <name>Summer1</name>
<desc>Image of Seasons: Summer</desc>
</object>

<object>
   <class>File</class>
   <name>Weather3</name>
<desc>Description of Weather</desc>
</object>

Basically I want a regular expression to only return the second object named Summer.

How would I go about this?

I am stuck here:

<object>(.*?)<class>File</class>(.*?)Description of Seasons: Summer(.*?)</object>

But I am getting the first object in my search results as well.

I have dot (.) to include new lines hence the syntax.

even if you get the 1st object too, why don't you just remove it from your results after you perform the regex — Sam I am says Reinstate Monica, Oct 23 '13 at 15:42
The dot doesn't match newlines. Using an xml parser or xpath will be easier. — Casimir et Hippolyte, Oct 23 '13 at 15:42
@SamIam - Good point...I didn't read the tags. Not fully awake yet :) — Tim, Oct 23 '13 at 15:43
You should read some of the other thousand or so posts about using regexes to parse XML, all of which contain at least one comment saying "Don't try to parse XML with a regex. Use an XML parser.". Start with any of them in the Related list to the right of your question text. — Ken White, Oct 23 '13 at 15:49
@SamIam this is a sample, I'm expecting in the actual XML file that it'll return upwards of 200+ results... seems too tedius to remove every other one it captures — user1683776, Oct 23 '13 at 15:50
@user1683776 It's not that tedious, at least not for the machine. — Sam I am says Reinstate Monica, Oct 23 '13 at 15:53
@KenWhite Do you know why people say you shouldn't parse XML with regex? — Sam I am says Reinstate Monica, Oct 23 '13 at 15:54
@SamIam http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg?rq=1 — clcto, Oct 23 '13 at 15:56
@SamIam: Certainly. Read the **very first question** in the Related list, which has 235 answers posted that explain exactly that thing. — Ken White, Oct 23 '13 at 15:58
@KenWhite IF all you can do is point me toward an external question, I'll assume that you merely know **of** problems with using Regex to parse xml, and that you don't actually know what the problems are — Sam I am says Reinstate Monica, Oct 23 '13 at 16:00
@KenWhite The reason why regex has problems with XML is because regex has no concept of hierarchy, and has therefore has a lot of trouble with nested elements. If the XML in question **always** follows the pattern specified by the OP, this problem is not as relevant. — Sam I am says Reinstate Monica, Oct 23 '13 at 16:02
@SamIam: I'm aware. I also don't see any reason to discuss it **here** yet again, when it's been discussed many, many times before. Thus the reason I keep referring it to the other thousands of places it's been done. No point in repeating the clutter here. — Ken White, Oct 23 '13 at 16:13
@KenWhite Ahem... If the XML in question always follows the pattern specified by the OP, the problems with using regex on xml are not as relevant — Sam I am says Reinstate Monica, Oct 23 '13 at 16:18
@SamIam: **Always** is a long time, and never seems to last as long as we think it will. Why the heck not do it right in the first place, instead of hacking something that will then have to change again next week or next month? And, once again, **not going to discuss it here**. Read the other thousand posts that have detailed discussions of the poor decision to use regexes to parse XML. I **am not** repeating them here yet again. — Ken White, Oct 23 '13 at 16:26
If the XML always followed this pattern (it won't, I can guarantee that), then it's a massive design error to use XML in the first place. You should use a flat record format like properties and not a faux-XML. — biziclop, Oct 24 '13 at 11:14

score 3 · Answer 1 · edited May 23 '17 at 10:25

3

You really will be better off not using a regular expression. See here for a good reason why regular expressions should not be used to parse XML.

A far simpler approach will be to use XPath e.g.

//object[name="Summer"]

If you applied this XPath expression to your XML (assuming you enclosed your malformed XML within a root tag) then it would only select the "2nd object named Summer".

There are XML libraries which support XPath in most if not all programming languages (C/C++, Java, .NET, javascript etc.)

edited May 23 '17 at 10:25

Community

1
1

answered Oct 23 '13 at 23:50

Ben Smith

19,589
6
65
93

1

The perfect answer imo. XPath is the W3C recommended technology for running queries over XML. Give Fresh a nice big green tick! – Gusdor Oct 24 '13 at 09:50

score 0 · Answer 2 · answered Oct 24 '13 at 09:38

0

A regex cannot be guaranteed to work for every scenario. There will be scenarios where it will fail. A parser is guaranteed to work for every scenario, regardless. XPath is what you want. This is a daily topic on SO, so I'll skip the sermon and try and solve the problem.

I'm using PCRE syntax:

~<object>.*?</object>.*?(<object>.*?</object>)~s

You'll need the s modifier so the . matches newlines. Your second object will be captured in group #1.

This is untested but should work.

answered Oct 24 '13 at 09:38

gwillie

1,893
1
12
14

A bad regex could fail but a bad parser could also fail. Perfect implementations of both will succeed. The difference is that a parser implementation will achieve perfection with less code, less stress and less margin for error. – Gusdor Oct 24 '13 at 09:51
No, regexes will ALWAYS fail, no matter how good your implementation. You show me any regex that supposedly parses a piece of XML correctly, and I'll show you a well-formed piece of XML that breaks it. – biziclop Oct 24 '13 at 11:01

biziclop · Answer 3 · 2013-10-24T11:07:13.173

Regular expressions, as their name implies can only recognise regular languages. Regular languages obey the regular pumping lemma, which states (roughly) that in every valid word of a regular language beyond a certain size, you'll find a portion of text that is repeatable infinitely to produce further valid words.

XML however isn't a regular language, it's a CF language. (You can prove this by applying the pumping lemma.)

Context-free languages can only be described by context-free grammars and parsed by context-free parsers (LL(k)/LR(k), CYK or Earley parser), all of which produce a parse tree that regular expressions can't.

XML Regex Search - Find specific blocks of code

3 Answers3