1

I need a C# RegEx pattern which can remove anything between < and >

E.g if my string is

<Html> some stuff here 123445!@#$% </HTML>

then the RegEx should return me only

some stuff here 123445!@#$%

It should remove anything between < and > and then also remove "<" & ">"

Ocaso Protal
  • 19,362
  • 8
  • 76
  • 83
NoobDeveloper
  • 1,877
  • 6
  • 30
  • 55
  • 1
    Are you talking specifically about just the `<` and `>` characters and the characters in between? I.e. 123<456>789 would return 123789? – ediblecode Jan 05 '12 at 09:01
  • Have you tried anything at all? – Samuel Harmer Jan 05 '12 at 09:06
  • look for escape html character. it is a common measure for xss. – Acn Jan 05 '12 at 09:07
  • @user1016253 Yes. I want anything between < and > to be removed and also < and >. So your example is perfect. So if have ! @ # $ % ^ & * ( ) more stuff then output should be ! @ # $ % ^ & * ( ) more stuff – NoobDeveloper Jan 05 '12 at 09:08
  • @Styne666 I have tried http://txt2re.com But i honestly admit , i am not good at this stuff. But I know that RegEx will be better than looping the string. Hence posting here instead of writing crap looping code. :) – NoobDeveloper Jan 05 '12 at 09:12
  • 3
    It looks like you're trying to [parse xml/html with regex](http://stackoverflow.com/a/1732454/3603). – Richard Szalay Jan 05 '12 at 09:14

3 Answers3

4
here is a working example : 

string plainText = Regex.Replace(htmlText, "<[^>]+?>", "");

http://regexr.com?2vl05

edit

Im talking as interpreter :

< = search for '<' char

[^>] = now continue search for char which is not '>'

+ continue searching for more instances of it

? but dont be greedy

> when im saying dont be greedy - i mean - until i will encounter with '>'

AND REPLACE THIS WITH ""

Royi Namir
  • 144,742
  • 138
  • 468
  • 792
  • This one is better than my example indeed, I forgot the exclusion class. Also the non-greedy flag is good – Michiel van Vaardegem Jan 05 '12 at 09:18
  • Thanks..Can you explain how do i read this ? "<[^>]+?>" Any good links to learn basics of RegEx ? I have always ignored this topic :) – NoobDeveloper Jan 05 '12 at 09:19
  • Why the lazy operator?? You try and match whatever is not a `>` until `>`! `[^>]` will _not_ match `>` anyway. So, quite the opposite, it could be a possessive operator if supported! – fge Jan 05 '12 at 09:28
  • Ah, this is C#, so possessive quantifiers are not supported. Thus an atomic group could be used: `<(?>[^>]+)>` -- but a plain `<[^>]+>` is enough if you don't want to bother – fge Jan 05 '12 at 09:31
1

Something like: \<.+\>?(.*)\<\/.+\>? Group one will be the string between the two tags

You could also do a regex replace on \<\/?.+\>, you should replace this pattern with nothing

Michiel van Vaardegem
  • 2,260
  • 20
  • 35
0

Using Regex with HTML might be a little bit dangerous, HTML is not a regular grammar and regex might fail in some notsoeasy to realize cases. If you are working with HTML and .NET, maybe you'd like to give a look to HTML Agility Pack

curial
  • 514
  • 4
  • 17