1

I am looking for a regular expression which can erase all spans keeping inner text. I have this kind of spans in my inner HTML.

Input

Properly formatted HTML

 <span style='font-size:10.0pt;font-family:"Arial","sans serif"'**>
            First span
        </span>
        <span style="color:#221E1F;">
        <span style='font-size:10.0pt;font-family:"Arial";color:windowtext'>
        This is to test Regular expression
        </span>
        </span>
        <span style="color:#221E1F;"><span style='font-size:10.0pt;font-family:
                "Arial","sans-serif";color:#548DD4'>
        last Span  text
        </span>
        </span>

Not formatted properly:

 <span style='font-size:10.0pt;font-family:"Arial","sans-serif";
    mso-bidi-font-style:italic'>&lt;%T</span><span class="A1"><span style='font-size:
    10.0pt;font-family:"Arial","sans-serif";mso-fareast-font-family:Calibri;
    mso-fareast-theme-font:minor-latin;color:windowtext'>PA_Enrollment_Options%&gt;
    one of the convenient options below</span></span><span class="A1"><span style='font-size:10.0pt;font-family:"Arial","sans-serif";mso-fareast-font-family:
    Calibri;mso-fareast-theme-font:minor-latin;color:#548DD4;mso-themecolor:text2;
    mso-themetint:153'>: <o:p></o:p></span></span>

Expected Output : First Span This is to test Regular expression last span text

I have tried this regex - (<span.*([\r\n]).*>)|(<span.*>)|(</span>).

This is working when my HTML is properly formatted, but in my case indentation of HTML is not proper.

I am not using regex to parsing completely . I am doing this operation in inner html only

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Manjay_TBAG
  • 2,176
  • 3
  • 23
  • 43
  • 1
    Use `]*>|`. Or HtmlAgilityPack to do it in a more proper way. – Wiktor Stribiżew Jul 22 '15 at 08:09
  • 2
    Please don't [delete your question](http://stackoverflow.com/q/31556115/1324033) and reask it, update your original question... – Sayse Jul 22 '15 at 08:15
  • 1
    You have given a similar link in your previous question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Use an html parser. Don't use regex. – EZI Jul 22 '15 at 08:16

1 Answers1

3

You can do it properly with HtmlAgilityPack:

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    // return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText); // Use if you want to convert HTML entities to their literal view
    return doc.DocumentNode.InnerText; // if you want to keep HTML entities
}

And then

var result = getCleanHtml(myInputHtml);

Here is the output:

enter image description here

In case you need to get rid of whitespace, you can use either a simple String.Replace, or a Regex.Replace or split/join method depending on what you actually need.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563