4

Against a string like this:

<h3>title</h3>
<h4>title</h4>

How to match the tags correspondingly and get the text in them?

This works but it unnecessarily gets the tag name:

'@<(h[34])>(.+)</\1>@sU'

However this doesn't seem to work as I don't want to get the tag name but just want to backreference it:

'@<(?:h[34])>(.+)</\1>@sU'

I'm using PHP preg_match(). Why doesn't the 2nd approach work? Is it possible to back reference a non-capturing group?

datasn.io
  • 12,564
  • 28
  • 113
  • 154
  • No, you can't backreference something that isn't there. How would `/.+\7/` work? Your second examples `\1` would match the `(.+)` content. – mario Aug 10 '14 at 03:27
  • @mario, so how does one match a HTML tag and the content in it using regex? Any common practice here? – datasn.io Aug 10 '14 at 03:36
  • 2
    You use a DOM parser and not regex – hjpotter92 Aug 10 '14 at 03:42
  • @hjpotter92, even for mal-formed DOM documents? I thought it's more universal to use regex so that I wouldn't freak out on broken DOM. Plus I may also need to parse something other than XML / HTML documents but ordinary strings with similar patterns. – datasn.io Aug 10 '14 at 03:45
  • @kavoir.com https://stackoverflow.com/q/6031546/1190388 – hjpotter92 Aug 10 '14 at 03:49
  • [**Don't parse X/HTML with regex. Don't parse X/HTML with regex. Don't parse X/HTML with regex.**](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) Did I mention not to parse X/HTML with regex? – Qix - MONICA WAS MISTREATED Aug 10 '14 at 05:03

1 Answers1

2

Capturing groups could be used later on in the regular expression as a backreference to what was matched in that captured group. By placing ?: inside you specify that the group is not to be captured, but to group expressions.

You can use the branch reset feature (?| ... | ... ) that way you don't have your expression matching non-corresponding tags and both capturing groups in the alternatives are considered as one capturing group.

~(?|<h3>(.+?)</h3>|<h4>(.+?)</h4>)~s

Live Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132