C# regular expressions with HTML strings

Question

I'm working on a small assignment that requires the use of regular expressions with HTML strings. My current problem is properly obtaining strings enclosed within HTML tags.

For instance:

I have a string

<p>&lt;Placeholder&gt;</p>

I've been able to obtain the contents with the following regex

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

Which would return:

<Placeholder>

However, should the string contain an additional HTML tag, e.g.:

<p><strong>Placeholder</strong></p>

I would get this

<strong>Placeholder

It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). Could anybody tell me where I've gone wrong?

EDIT:

To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? To cover the possibility that the string could contain special characters (e.g. > <)

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Habib, Oct 09 '12 at 06:59
Try using HtmlAgilityPack [http://htmlagilitypack.codeplex.com/](http://htmlagilitypack.codeplex.com/) — animaonline, Oct 09 '12 at 06:59
I'm trying to avoid libraries if possible. But I'll check it out! — Winz, Oct 09 '12 at 07:07
In my own project, I included the necessary source code, there's about 15 files. So it's pretty compact ;) I believe regex is an overkill solution. Good luck anyway! — animaonline, Oct 09 '12 at 07:09
"I'm trying to avoid libraries if possible." Reinventing the wheel is not good. You should *use* libraries if possible, unless you have a really good reason not to. — dan1111, Oct 09 '12 at 07:29
@dan1111 not trying to reinvent anything. Just trying to gain a better understanding of how regexes would work if I'm going to handle XMLs or HTMLs :) — Winz, Oct 09 '12 at 08:14
@Winz, if you are doing this as a learning exercize, then fair enough. — dan1111, Oct 09 '12 at 08:18

score 1 · Accepted Answer · answered Oct 09 '12 at 07:57

I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is:

An alternation will use the first match it will find and will not look for further matches. So when you search at the start for

^<.*?>|^<.*?><.*?>

on the string

<p><strong>Placeholder</strong></p>

It will match on the first alternative and therefore it will end with a successful match on the first alternative. So if you want to match <p><strong> at the start you should change the ordering in the alternation. but only for the part at the start of the string, for the end of the string your ordering is fine.

So for your example this would work:

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

==> The ordering inside an alternation can be important

An alternative would be to use a quantifier instead of an alternation:

string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");

this would work also for more than 2 tags.

C# regular expressions with HTML strings

1 Answers1