-2

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a string which is html like this

<html>
  <div>
      <p>this is sample content</p>
  </div>
  <div>
      <p>this is another sample</p>
      <span class="test">this sample should not caught</span>
      <div>
       this is another sample
      </div>
  </div>
</html>

now i want to search the word sample from this string, here i should not get the "sample" which is inside the <span>...</span>

I want this to be done using regex, i tried a lot but i cant do it, any help is greatful.

Thanks in advance.

CalvT
  • 3,123
  • 6
  • 37
  • 54
Hulk
  • 215
  • 1
  • 5
  • 24
  • 3
    Unless this piece of html is always the same, it's a bad idea to parse html/xml with a regular expression. [Here's why](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). – acme Sep 21 '12 at 09:26
  • @acme i agree with u, but how to solve my problem.. is there any idea to solve this.. – Hulk Sep 21 '12 at 09:31
  • 1
    Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. – Madara's Ghost Sep 22 '12 at 09:01

1 Answers1

4

This is quite brittle and fails if there can be nested span tags. If you don't have those, try

(?s)sample(?!(?:(?!</?span).)*</span>)

This matches sample only if the next following span tag (if any) is not a closing tag.

Explanation:

(?s)          # Switch on dot-matches-all mode
sample        # Match "sample".
(?!           # only if it's not followed by the following regex:
 (?:          #  Match...
  (?!</?span) #   (unless we're at the start of a span tag)
  .           #   any character
 )*           #  any number of times.
 </span>      #  Match a closing span tag.
)             # End of lookahead

To match sample only if it's neither within a span nor a p, you can use

(?s)sample(?!(?:(?!</?span).)*</span>)(?!(?:(?!</?p).)*</p>)

But all this depends entirely on tags being unnested (i. e., no two tags of the same kind may be nested) and correctly balanced (which often isn't given with p tags).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • hi its working fine, can i skip more than one tag using this regex, i.e., if i want to skip content from both and

    tag, how to do it using this..

    – Hulk Sep 21 '12 at 09:49
  • Just add another lookahead. See my edit. – Tim Pietzcker Sep 21 '12 at 09:52
  • make the closing tag / as optional in look ahead, i think this will solve nested tags prob... sample(?!(?:(?!?span).)*?span>)(?!(?:(?!?p).)*?p>) – Hulk Sep 21 '12 at 10:06
  • No, it won't. Then it won't match at all if a `span` or `p` tag follows in the string. If it does, it's because you need to switch on dot-matches-all mode for this regex to work with multiline strings. I had forgotten to make this explicit. I've edited my answer accordingly. – Tim Pietzcker Sep 21 '12 at 10:08
  • no, its matching correctly only... just check once, i tested using "expresso" – Hulk Sep 21 '12 at 10:11
  • As I said, you need to turn on dot-matches-all mode (in expresso and otherwise). Otherwise this regex fails on tags that span multiple lines. Take your pick :) – Tim Pietzcker Sep 21 '12 at 10:12
  • hi.. its working perctly fine for me... the only prob i was facing is, if i was using the large string then it taking more time to capture the matches, can u suggest anything related to this problem? – Hulk Sep 24 '12 at 12:56
  • Yes, this is to be expected - the nested lookaheads are rather inefficient. That's probably unavoidable - which is another reason why HTML should be parsed instead of regexed. – Tim Pietzcker Sep 24 '12 at 13:00
  • which parser can i use for this situation.. – Hulk Sep 24 '12 at 13:23
  • Perhaps you can find a workable solution here: http://stackoverflow.com/search?q=parse+html+objective-c – Tim Pietzcker Sep 24 '12 at 14:16