Get colored texts within HTML code

Question

I have a Html code and I want to Convert it to plain text but keep only colored text tags. for example: when I have below Html:

<body>

This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>

I want the output:

this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...

@Ali.M - Satya means what are you using to even try to pull this off? Jquery? PHP? RegEx? — Anthony, Apr 18 '12 at 07:00
Also, you want to replace `
` by `<>`, but `` by nothing? — Mr Lister, Apr 18 '12 at 07:13
yes, I want to replace colors to custom tag only and remove other html tags. I want to do it with C#. — Ali.M, Apr 18 '12 at 10:07

score 1 · Answer 1 · edited May 23 '17 at 11:52

1

It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.

The first regexp I came with to solve the problem is:

<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>

Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.

Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

edited May 23 '17 at 11:52

Community

1
1

answered Apr 18 '12 at 08:27

Michał Miszczyszyn

11,835
2
35
53

thanks, but it won't work. I am working more with regex to find out a solutin with it. – Ali.M Apr 18 '12 at 10:05

score 1 · Accepted Answer · answered Apr 18 '12 at 14:02

I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.

First, find all the nodes that contain style attribute with color defined in it by using xpath:

var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
    .SelectNodes("//*[contains(@style, 'color')]")
    .ToArray();

Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.

var colorRegex = new Regex(@"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);

Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):

foreach (var node in nodes)
{
    var style = node.Attributes["style"].Value;
    if (colorRegex.IsMatch(style))
    {
        var color = colorRegex.Match(style).Value;
        node.InnerHtml =
            HttpUtility.HtmlEncode("<" + color + ">") +
            node.InnerHtml +
            HttpUtility.HtmlEncode("</" + color + ">");
    }
}

And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):

var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);

This should return something like this:

This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...

Of course you could improve it for your needs.

Get colored texts within HTML code

2 Answers2