0

I have a Html code and I want to Convert it to plain text but keep only colored text tags. for example: when I have below Html:

<body>

This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>

I want the output:

this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
Oleks
  • 31,955
  • 11
  • 77
  • 132
Ali.M
  • 311
  • 6
  • 24

2 Answers2

1

It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.

The first regexp I came with to solve the problem is:

<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>

Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.

Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

Community
  • 1
  • 1
Michał Miszczyszyn
  • 11,835
  • 2
  • 35
  • 53
1

I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.

First, find all the nodes that contain style attribute with color defined in it by using xpath:

var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
    .SelectNodes("//*[contains(@style, 'color')]")
    .ToArray();

Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.

var colorRegex = new Regex(@"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);

Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):

foreach (var node in nodes)
{
    var style = node.Attributes["style"].Value;
    if (colorRegex.IsMatch(style))
    {
        var color = colorRegex.Match(style).Value;
        node.InnerHtml =
            HttpUtility.HtmlEncode("<" + color + ">") +
            node.InnerHtml +
            HttpUtility.HtmlEncode("</" + color + ">");
    }
}

And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):

var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);

This should return something like this:

This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...

Of course you could improve it for your needs.

Oleks
  • 31,955
  • 11
  • 77
  • 132