0

i want to write a regex where it should ignore tag which is present between the string.

e.g., here is my string

<p>hi this is a reg<del>U</del><ins>u</ins>lar expression match</p>

i want a regular expression to find "regular" from above string, the match should find whole word including tags i.e., reg<del>U</del><ins>u</ins>lar

here case can be ignored.

plz help me.. thanks in advance

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Hulk
  • 215
  • 1
  • 5
  • 24
  • 2
    Gah! [Don't parse HTML with regex](http://stackoverflow.com/a/1732454/383609)! If you insist on using regex, what language are you doing this in? With jQuery, it's trivial to get just the text, for example. – Bojangles Aug 21 '12 at 08:48
  • Ignoring the tags would mean that your string becomes `regUular` (because those tags have meaning, you know). So what exactly do you want to ignore? @JamWaffles: This also would need to be taken into account with a jQuery solution, making it nontrivial. – Tim Pietzcker Aug 21 '12 at 08:50
  • @Tim Very true, I didn't see that extra `U` in the HTML. – Bojangles Aug 21 '12 at 08:52
  • @JamWaffles i was using c# in asp.net – Hulk Aug 21 '12 at 09:06
  • @TimPietzcker i want to ignore the content which is inside the tag, so my match will be just 'regular' not 'regUular' – Hulk Aug 21 '12 at 09:08
  • So is it *just* the `` tag you want to ignore with its contents? And all other tags should be treated like they weren't even there? – Tim Pietzcker Aug 21 '12 at 09:14
  • @TimPietzcker yes.. the same what u said is my requirement. – Hulk Aug 21 '12 at 09:21

2 Answers2

2

I don't think you can get a robust solution in regex. At any rate, it won't be very readable. Here, in verbose form, is a regex that conforms to your revised specifications. Note that it fails to handle <del> tags that contain any nested tags - that's impossible to do with nsregularexpressions.

\b        # Start of word
r         # Match r
(?:       # Match either
 <del>    #  <del>
 [^<>]*   #  any characters besides angle brackets
 </del>   #  </del>
|         # or
 <[^<>]*> #  any other tag
)*        # End of alternation
e         # Match e
(?:<del>[^<>]*</del>|<[^<>]*>)*  # etc...
g
(?:<del>[^<>]*</del>|<[^<>]*>)*
u
(?:<del>[^<>]*</del>|<[^<>]*>)*
l
(?:<del>[^<>]*</del>|<[^<>]*>)*
a
(?:<del>[^<>]*</del>|<[^<>]*>)*
r
\b
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • this is fine when i am dealing with large string lik the example i given, bt for all the general cases it wont help know.. – Hulk Aug 21 '12 at 09:10
  • and one more thing is that for me there will be only tags in between the characters, above regex will match all charcters other than tags also.... – Hulk Aug 21 '12 at 09:14
  • @harish: I've updated the regex. It works on your example and follows your new specs closely. – Tim Pietzcker Aug 21 '12 at 10:00
  • thanks.. its working fine, i will check this with different combinations and let u know if any prob exists – Hulk Aug 21 '12 at 10:18
  • hi.. can u check this question http://stackoverflow.com/questions/12474742/skip-xml-content-while-doing-regex-search-and-replace – Hulk Sep 18 '12 at 11:49
  • i am struggling to find the exact method(or answer) for that post.. plz help me.... – Hulk Sep 18 '12 at 11:50
0

You really need some form of HTML parser here. Regexps are unsuited for HTML and you'll spend your time refining and tweaking to try and cover all the edge cases (which you just can't).

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440