Regex - replace some html tag

Question

I would like replace some html empty tag like <. /> (. is b, h1, ... but not br, hr, ...)

I think to : Regex.Replace(myString, "<..? />", "") but I don't know how can I exclude br and hr.

Anybody can help me?

Thx!

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Dan Puzey, Aug 10 '12 at 14:17
@DavidB Recognizing and replacing specific string patterns is not parsing. — Sean U, Aug 10 '12 at 14:19
@SeanU No, but parsing is done in the process of doing those steps... — Servy, Aug 10 '12 at 14:21
This isn't a duplicate, and doesn't necessarily require a parser - this looks like there is some auto-generated HTML that is creating empty tags. Regex is a great solution for this simple problem. Before you hit `close` or `-1`, _read the blog post in the Community Bulletin on the right_. — cjk, Aug 10 '12 at 14:24
@cjk Because we think the question fits a criteria for deletion/closure is not being mean. — David B, Aug 10 '12 at 14:26
Some people automatically downvotes when they see `HTML` and `Regex` in same post. — mmdemirbas, Aug 10 '12 at 14:26
@DavidB but have you seen the words Regex and HTML together and automatically gone down the "you must parse it" route without thinking about this specific problem? This looks like it will be pretty simple... — cjk, Aug 10 '12 at 14:27
@cjk It's possible without a parser, sure, but I would much, much rather do it that way. Also David is right, the question as it stands has many flaws, the downvotes are justified. — Alexander R, Aug 10 '12 at 14:27
@AlexanderR if the question has fundamental flaws, then yes. If it has typographic problems, fix them. I understand the question and have posed an answer, therefore I see it as a valid question. — cjk, Aug 10 '12 at 14:29
@cjk I've removed my original comment because I now believe that this *could* be done with regex. But yes, as a rule of thumb, when people start saying HTML & regex together, I go into defensive mode. When you see tons of these posts, you can make a few snap judgements. — David B, Aug 10 '12 at 14:29
@DavidB I totally understand where you're coming from, the list of related questions on the right is full of situations where people see everything as their nail once they have Regex as their hammer. I guess I jumped on the response this one after reading the blog post... — cjk, Aug 10 '12 at 14:31
@cjk Regex will work in most situations, and with the simple examples and likely test cases. It won't be able to work globally with any given input, which will potentially result in bugs down the road. These headaches can be preemptively removed by not using Regex for HTML modifications. — Servy, Aug 10 '12 at 14:32
@Servy Yes, recognizing specific patterns is often an important part of lexing, which is in turn often the first step in many parsers. It's also a part that's usually done using regular expressions, because they're perfect for that task. It's true that you cannot parse HTML using regular expressions alone. . . but it's also true that a good HTML parser uses regular expressions to do the bit that parsing has in common with the task OP wants to accomplish. — Sean U, Aug 10 '12 at 14:36

score 3 · Answer 1 · answered Aug 10 '12 at 14:26

3

If you know which tags you want to do, you could do it like this:

Regex.Replace(myString, "<(b|p|div|span) />", "")

Within the brackets, all options are pipe-delimited.

answered Aug 10 '12 at 14:26

cjk

45,739
9
81
112

score 3 · Accepted Answer · edited Nov 13 '13 at 10:19

3

Try something like this:

(?:< *)(?!(?:br|hr)) *\w+ *\/ *\>

Add any tags to br|hr part(delimit them using '|') that you don't want to match.

edited Nov 13 '13 at 10:19

Epaga

38,231
58
157
245

answered Aug 10 '12 at 14:34

Alexander Demyanenko

101
4

This is called zero-width negative lookahead if you're curious you can read about it here: http://msdn.microsoft.com/en-us/library/az24scfc.aspx – crlanglois Aug 10 '12 at 14:50
A slightly simpler version that works in your case: <(?!br|hr)(\w)+/> – crlanglois Aug 10 '12 at 14:52

score 1 · Answer 3 · answered Aug 10 '12 at 14:28

Use a pattern like this to match and replace them:

<(TAG1|TAG2|TAG3|...)\s*/?>

where (TAG1|TAG2|TAG3|...) is all the tags you want to handle, separated by pipes. Be sure to also specify that the regular expression should be case-insensitive, since HTML tags are case-insensitive. For example, to recognize just the two you listed, you could create a regex like this:

var exp = new Regex("<(b|h1)\s*/?>", RegexOptions.IgnoreCase);

How it works:

The bit in parentheses just identifies the tags that it should handle.
\s* recognizes zero or more whitespace characters. (One of these isn't needed at the start of the regex, because the html standard doesn't allow whitespace before the tag name.)
/? optionally matches a '/'. (This is just to be flexible about handling HTML that doesn't use the / in empty tags, since the HTML spec didn't always require it.)

You can use it to remove tags like so:

var strippedText = exp.Replace(input, String.Empty);

Regex - replace some html tag

3 Answers3