any idea why my regex isn't working properly?

Question

i'm trying to get all the

<br> </br> <br/> and <br />  <p> and </p>

in my code but my current regex is getting

<b> 
/* and */ 
</b>

aswell i would like them excluded. How would i go about this?

 private static string StripTagsRegex(string source)
 {
    return Regex.Replace(source, "<.?br?/?>|<.?p?/?>", string.Empty);
 }

always show code. You mentioned C#, but do not show any C# code. — abelenky, Feb 19 '14 at 21:03
Never ever ever try to parse HTML with RegEx : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 use something like https://htmlagilitypack.codeplex.com/ — aybe, Feb 19 '14 at 21:05
What you wanna do is pretty unclear. Can you show hypothetical input and what you want the output to be? @Aybe: delete all `br` and `p` tags isn't really "parsing HTML" in my opinion — Robin, Feb 19 '14 at 21:07
Aybe's comment notwithstanding, if you want to parse html with regex, you should be very careful not to match closing brackets of non-immediate nature. so instead it should be something like (I haven't checked, so may be incorrect) `<\/?(?:br|p)[^>]*\/?>` Point being don't use `.` to skip over characters as you can skip over closing `>`. Instead search for anything other than `>`, that is `[^>]`. — LB2, Feb 19 '14 at 21:15
@Robin as HTML can be malformed at times, it's certainly a better idea to parse a tag and get its content rather than trying to strip out tags. By looking at the outstanding number of votes (4430) in the link I've sent, I think this is the correct approach. — aybe, Feb 19 '14 at 21:21
@Aybe: Yep, that question is pretty famous and its answer is suited. This is a different question though, as the OP doesn't want to catch what's in between tags: just remove some tags. Without even checking if they match. You should also look at the second answer on your link, 1293 votes: *Never ever* is way too drastic. One should be aware of the risks, yes. — Robin, Feb 19 '14 at 21:28

Sam I am says Reinstate Monica · Accepted Answer · 2014-02-19T21:13:35.150

2

get rid of the ? after your br and p, and change .? to `/?

return Regex.Replace(source, @"</?br/?>|</?p/?>", string.Empty);

a consequence of this is that it will also remove certain invalid tags such as </p/>, and if that's a big deal, you can just have 4 cases instead of 2.

http://rubular.com/r/CqkUQKCCuR

edited Feb 19 '14 at 21:13

answered Feb 19 '14 at 21:07

Sam I am says Reinstate Monica

30,851
12
72
100

Do you really have to escape `/` natively in C#? In rubular you have to because the regex delimiter *are* forward slashes. But it's usually not a special character in regex, you wouldn'd need to escape it if you used `#` (`#regex_with_/#flag`) for example – Robin Feb 19 '14 at 21:10
Oh, not quite what i wanted, but i found the solution thanks to you. i had a ? after the p which was catching bold tag – Protonblast Feb 19 '14 at 21:11
@Robin u know, you're right. I copied and pasted the OP's code exactly and it worked – Sam I am says Reinstate Monica Feb 19 '14 at 21:13

LB2 · Answer 2 · 2014-02-19T21:45:17.153

0

reposting comment as answer at Robin's suggestion:

As others mentioned, you should use HTML parser for HTML parsing, but if you want to parse html with regex, you should be very careful not to match closing brackets of non-immediate nature. so instead it should be something like (I haven't checked, so may be incorrect) </?(?:br|p)(\s|/)[^>]*> Point being don't use . to skip over characters as you can skip over closing >. Instead search for anything other than >, that is [^>].

edited Feb 19 '14 at 21:45

answered Feb 19 '14 at 21:21

LB2

4,802
19
35

Actually, that would match `` as well. My bad :/ The `[^>]` is a good general advice, but here there are no wild card necessary, or `*` quantifier. – Robin Feb 19 '14 at 21:24
actually
tag may have style attributes (br i believe as well), so it is needed, but you're right, will match.... let me see how I can fix it (again without testing as I don't have access to C# right now)
– LB2 Feb 19 '14 at 21:26
There, I think a strategically positioned `\b` will do the trick hopefully... (and it goes to show why you shouldn't use regex to parse html :) ) – LB2 Feb 19 '14 at 21:28
Yep, but OP didn't ask to parse HTML nor to catch attributes: just match 6 specific strings that happen to be HTML tags. If the use is limited, it can be very much enough. Also FYI if you want to test regex online, you can use various tools such as http://regexpal.com/ (simple stuff) or http://regex101.com/ (complicated stuff) – Robin Feb 19 '14 at 21:30
Also **1**. `` would match, and **2**. the last `/?` is irrelevant, possible forward slash would be matched by `[^>]*`. Confirming that indeed, HTML can't be parsed with regex :/ – Robin Feb 19 '14 at 21:37
ok, last attempt to save my face. I think that should reasonably do it: `?(?:br|p)(\s|/)[^>]*>` – LB2 Feb 19 '14 at 21:46
**1**. Keep in mind that `(?:\s|/)` is also `[\s/]` **2**. That would match `
` and weird stuff like `
`... Have you noticed you're trying to parse HTML with regex? :) Regex is only useful for OP if he uses them for a narrowly defined case. Adding support for whitespaces is probably the most you can safely do. – Robin Feb 19 '14 at 22:05
Right, and as I stated in the answer, one should use html parser for html parsing. And idea here is not to catch really screwed up input like
which is not a valid html tag, nor
> (which is also invalid). Idea is that if OP insists on RegEx use, then at least this will provide reasonable parsing that wouldn't trip on other valid tags, but is not intended to validate full html. So at this point I'd say within its limits, it's a valid workaround for cases where html parser cannot be used. – LB2 Feb 19 '14 at 22:10

any idea why my regex isn't working properly?

2 Answers2