4

This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string

<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>

Using C# Regex I'm attempting to extract the groups of 3 capital letters.

At the moment if I use pattern >.+?</ I get

Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</

If I then use id=".+?"> I get

Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">

Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</

However, this isn't giving me the combined results of both patterns

My questions are:

  1. Can someone explain why this isn't working as expected?

  2. How can I correct the pattern to get both results shown combined in correct order listed

  3. How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.

Thank you

  • I don't quite understand what you want, but if you just want groups of three capital letters `\b([A-Z]{3})\b` – CaffGeek Oct 02 '12 at 20:09

4 Answers4

4

Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:

(id="|>).+?(">|</)

However, regex is not the right tool to parse XML.

Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:

(id="|>)(.+?)(">|</)

will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:

(?:id="|>)(.+?)(?:">|</)

to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks - this also gives the correct answer! I usually test the c# ones here http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx –  Oct 02 '12 at 20:17
2

You need to group the alternatives together

(?:id="|>).+?(?:">|</)

And to get the letters only use positve lookbehind and lookahead assertions

(?<=id="|>).+?(?=">|</)

See it here on Regexr

The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.

stema
  • 90,351
  • 20
  • 107
  • 135
  • Are you sure variable length lookbehind is allowed in c# regex? – Ωmega Oct 02 '12 at 20:16
  • Also a lookbehind is less efficient I think, because the string has to be traversed twice. So I think it's a bit overkill in this situation. Of course, it's a more general and thus rather elegant solution ;). – Martin Ender Oct 02 '12 at 20:18
  • @m.buettner - It may be okay about performance, as it saves some ticks on not creating backreference (as your code does). – Ωmega Oct 02 '12 at 20:21
  • Haha, fair enough. Someone should probably profile it before making such claims ^^. I just wanted to bring it to mind that lookbehinds can generally incur some overhead. – Martin Ender Oct 02 '12 at 20:23
  • @Ωmega, yes c# is the only language I know without restrictions on the lookbehinds. – stema Oct 02 '12 at 20:25
  • @stema - Thanks for that information. I was not sure, as I code regex in Perl. Should I switch to C#..? :)) I wish Perl makes this possible as well... – Ωmega Oct 02 '12 at 20:26
1

I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)

Test it here on RegExr.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
1

Capturing groups FTW!

@">(?<content>.+?)<|id=""(?<content>.+?)"""

Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).

Alan Moore
  • 73,866
  • 12
  • 100
  • 156