0

Currently I'm eliminating certain formatting tags from HTML strings, and would like to learn enough about regular expressions to be able to replace any formatting. For starters, this is what I've done, but I'd like it to work with any font size, family, etc:

            p.body = p.body.Replace("line-height: 14px;", "");
            p.body = p.body.Replace("font-family: Arial, Helvetica, sans;", "");
            p.body = p.body.Replace("font-size: 11px;", "");

I'm actually not sure if regex can be used directly inside C# replace or not.

Dave
  • 4,949
  • 6
  • 50
  • 73
  • Why are you eliminating formatting from HTML? If this is to sanitize user input, then you'd be better off with a whitelist. – Thom Smith Oct 16 '12 at 13:17
  • Regarding your last sentence: it's [not hard to find out](http://msdn.microsoft.com/en-us/library/fk49wtc1.aspx). (And the answer is "no.") – Dan Puzey Oct 16 '12 at 13:21
  • Sorry, it was worth the downvote. Regex looks like garbledy gook to me. What took you guys 2 minutes would have taken me all day. – Dave Oct 16 '12 at 13:26
  • Thom, I have a Content Editable div. They produce formatting differently depending on what browser is being used. I need to get rid of all that formatting if possible. – Dave Oct 16 '12 at 13:36

4 Answers4

4

Helper function

    public static string RemoveStyle(string html, string style)
    {
        Regex regex = new Regex(style + "\\s*:.*?;?");

        return regex.Replace(html, string.Empty);
    }

Usage:

string input = "color: red ; line-height: 10px  ; font-family: Arial, Helvetica, sans;  ";
input = RemoveStyle(input, "line-height");
input = RemoveStyle(input, "font-family");

// now, input = "color: red ;"
Kache
  • 15,647
  • 12
  • 51
  • 79
Neverever
  • 15,890
  • 3
  • 32
  • 50
2

To use regular expressions in C#, you'll need to use the Regex Class.

To match only the specific types of styles you provided, I would try to match:

"line-height\\s:.*?;?"
"font-family\\s:.*?;?"
"font-size\\s:.*?;?"

or, all together:

Regex.Replace(htmlString, "(line-height|font-family|font-size)\\s:.*?;?", String.Empty);
Kache
  • 15,647
  • 12
  • 51
  • 79
2

Regex.Replace - MSDN

You can strip the entire style attribute.. perhaps like so?

Console.Write(Regex.Replace("<td style=\"text-align: right; vertical-align: bottom; width: 368px;\">", " style=\"[^\"]+\"", "")); // outputs "<td>"
Simon Whitehead
  • 63,300
  • 9
  • 114
  • 138
1

Alright, let me start off by saying that what you're trying has become the new traveling salesman problem. But, I wanted to reference this post in which the post below the accepted answer states you can in fact parse HTML with regular expressions - you just don't want to. Please read it because it will help you understand the hurdles.

Now, on to your specific problem.

Let's say you had some HTMl like this:

<html>
<head>
</head>
<body>
    <span style="line-height: 14px; font-family: Arial, Helvetica, sans; font-size: 11px;">Some text in the span</span>
</body>
</html>

And you wanted to find and replace the line-height, you might write a RegEx like this:

line-height.+?;

And I think you can extrapolate the rest from that RegEx. However, the problem is that you're assuming that there is a ; ending that statement always - and with CSS I'm not sure you can assume that, so that's why everybody tells you it can't be done with regular expressions. But follow along with me for a minute. Now, in C# you might write something like this (documented here):

var newString = RegEx.Replace(htmlString, "(line-height:)(.+?)(;)", "$1 $3");

The $1 and $3 will preserve the first and third captured expressions.

Community
  • 1
  • 1
Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232
  • Thanks Mike. So far, so good. I'll deal with any problems as they arise. – Dave Oct 16 '12 at 13:40
  • 1
    Good point about the "assuming `;`" thing, because I'm pretty sure you can't assume that, actually. For example, I think this is valid HTML: ``. – Kache Oct 16 '12 at 13:44
  • @Kache, in fact you're 100% right - I couldn't think of the example but I knew there was one out there! – Mike Perrenoud Oct 16 '12 at 13:50
  • Mike, I didn't bother mentioning that the styles I'm removing are added into contenteditable div's by the browser, and not typed in by the user. I can block the user from adding tags. I'm hoping (assuming) the browser is consistently adding semicolons after each style. Neverever got an answer in first. Otherwise your answer probably would have worked too. – Dave Oct 16 '12 at 15:01
  • @Dave, it's not a problem, but I wanted to ensure you understood the hurdles that exist with finding HTML using regular expressions, it's ***very*** complex and ***extremely*** fickle. – Mike Perrenoud Oct 16 '12 at 15:05