42

I need to convert HTML string to plain text (preferably using HTML Agility pack). With proper white-spaces and, especially, proper line-breaks.

And by "proper line-breaks" I mean that this code:

<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>

Should be converted as

line1
line2

I.e. only one line-break.

Most of the solutions I've seen simply convert all <div> <br> <p> tags to \n which, obviously, s*cks.

Any suggestions for html-to-plaintext rendering logic for C#? Not the complete code, at least common logic answers like "replace all closing DIVs with line-breaks, but only if the next sibling is not a DIV too" will really help.

Things I tried: simply getting the .InnerText property (wrong obviously), regex (slow, painful, lots of hacks, also regexs are 12 times slower then HtmlAgilityPack - I measured it), this solution and similar (returns more line-breaks then required)

Alex from Jitbit
  • 53,710
  • 19
  • 160
  • 149
  • 2
    It should be possible to check the HtmlNode type (block or not) and do some intelligent layout... If you want to take the HtmlAgilityPack route. Apart from that, this BCL class may work: https://msdn.microsoft.com/en-us/library/windows/apps/windows.data.html.htmlutilities.converttotext.Aspx – jessehouwing May 01 '15 at 22:05
  • @jessehouwing Yes, that's what I was thinking. PS. I should have probably mentioned it's an ASP.NET MVC app (.NET 4), wouldn't want to use "metro apps" classes. – Alex from Jitbit May 01 '15 at 22:14
  • 1
    Your question as worded is off-topic for stackoverflow: "Any suggestion for a lightweight html-to-plaintext rendering engine for C#?" - SO is not for software recommendations. There's close votes against it, and there would be more if not for the open bounty. You should consider rewording your question. – antiduh May 06 '15 at 18:44
  • @antiduh thanks, doing that – Alex from Jitbit May 06 '15 at 21:28
  • 1
    In my opinion, Html itself can't be converted correctly to plain text at all. At least, not without taking the css into acount. – Zohar Peled May 06 '15 at 21:51

10 Answers10

26

The code below works correctly with the example provided, even deals with some weird stuff like <div><br></div>, there're still some things to improve, but the basic idea is there. See the comments.

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head"))
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br"))
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return doc.DocumentNode.InnerText.Trim();

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}    

//here's the extension method I use
private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector)
{
    return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node));
}
Serge Shultz
  • 5,888
  • 3
  • 27
  • 17
  • What is SafeSelectNodes? What library are you using – Bas May 06 '15 at 20:49
  • This throws a `NullReferenceException` in the first `SelectNodes` for me when using the example in the question for me. – Bas May 06 '15 at 20:50
  • sorry, forgot to include the extension method I always use with HtmlAgilityPack (found it on SO by the way) – Serge Shultz May 06 '15 at 21:07
  • @SergeShultz I was just wondering if you found a better way to handle multiple "\n\n"? – Shyamal Parikh Mar 19 '17 at 17:54
  • Concerning the `//todo` after the return, `System.Web.HttpUtility` handles that on its own, just use `return HttpUtility.HtmlDecode(doc.DocumentNode.InnerText.Trim());` at the end. – Ziad Akiki Jun 02 '20 at 13:32
  • This approach threw "Multiple node elments can't be created." on one of the first test cases I tried with it, eg, S1S2 -- I'm posting an alternative as a separate answer. – pettys Nov 10 '22 at 22:44
15

Concerns:

  1. Non visible tags (script, style)
  2. Block-level tags
  3. Inline tags
  4. Br tag
  5. Wrappable spaces (leading, trailing and multi whitespaces)
  6. Hard spaces
  7. Entities

Algebraic decision:

  plain-text = Process(Plain(html))

  Plain(node-s) => Plain(node-0), Plain(node-1), ..., Plain(node-N)
  Plain(BR) => BR
  Plain(not-visible-element(child-s)) => nil
  Plain(block-element(child-s)) => BS, Plain(child-s), BE
  Plain(inline-element(child-s)) => Plain(child-s)   
  Plain(text) => ch-0, ch-1, .., ch-N

  Process(symbol-s) => Process(start-line, symbol-s)

  Process(start-line, BR, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(start-line, BS, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, BE, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(start-line, space, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, common-symbol, symbol-s) => Print(common-symbol), 
                                                  Process(not-ws, symbol-s)

  Process(not-ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(not-ws, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(not-ws, space, symbol-s) => Process(ws, symbol-s)
  Process(not-ws, common-symbol, symbol-s) => Process(ws, symbol-s)

  Process(ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(ws, hard-space, symbol-s) => Print(' '), Print(' '), 
                                       Process(not-ws, symbol-s)
  Process(ws, space, symbol-s) => Process(ws, symbol-s)
  Process(ws, common-symbol, symbol-s) => Print(' '), Print(common-symbol),
                                          Process(not-ws, symbol-s)

C# decision for HtmlAgilityPack and System.Xml.Linq:

  //HtmlAgilityPack part
  public static string ToPlainText(this HtmlAgilityPack.HtmlDocument doc)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, new[]{doc.DocumentNode});
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<HtmlAgilityPack.HtmlNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is HtmlAgilityPack.HtmlTextNode)
      {
        var text = (HtmlAgilityPack.HtmlTextNode)node;
        Process(builder, ref state, HtmlAgilityPack.HtmlEntity.DeEntitize(text.Text).ToCharArray());
      }
      else
      {
        var tag = node.Name.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, node.ChildNodes);
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, node.ChildNodes);
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }

    }
  }

  //System.Xml.Linq part
  public static string ToPlainText(this IEnumerable<XNode> nodes)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, nodes);
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<XNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is XElement)
      {
        var element = (XElement)node;
        var tag = element.Name.LocalName.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, element.Nodes());
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, element.Nodes());
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }
      else if (node is XText)
      {
        var text = (XText)node;
        Process(builder, ref state, text.Value.ToCharArray());
      }
    }
  }
  //common part
  public static void Process(System.Text.StringBuilder builder, ref ToPlainTextState state, params char[] chars)
  {
    foreach (var ch in chars)
    {
      if (char.IsWhiteSpace(ch))
      {
        if (IsHardSpace(ch))
        {
          if (state == ToPlainTextState.WhiteSpace)
            builder.Append(' ');
          builder.Append(' ');
          state = ToPlainTextState.NotWhiteSpace;
        }
        else
        {
          if (state == ToPlainTextState.NotWhiteSpace)
            state = ToPlainTextState.WhiteSpace;
        }
      }
      else
      {
        if (state == ToPlainTextState.WhiteSpace)
          builder.Append(' ');
        builder.Append(ch);
        state = ToPlainTextState.NotWhiteSpace;
      }
    }
  }
  static bool IsHardSpace(char ch)
  {
    return ch == 0xA0 || ch ==  0x2007 || ch == 0x202F;
  }

  private static readonly HashSet<string> InlineTags = new HashSet<string>
  {
      //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente
      "b", "big", "i", "small", "tt", "abbr", "acronym", 
      "cite", "code", "dfn", "em", "kbd", "strong", "samp", 
      "var", "a", "bdo", "br", "img", "map", "object", "q", 
      "script", "span", "sub", "sup", "button", "input", "label", 
      "select", "textarea"
  };

  private static readonly HashSet<string> NonVisibleTags = new HashSet<string>
  {
      "script", "style"
  };

  public enum ToPlainTextState
  {
    StartLine = 0,
    NotWhiteSpace,
    WhiteSpace,
  }

}

Examples:

// <div>  1 </div>  2 <div> 3  </div>
1
2
3
//  <div>1  <br/><br/>&#160; <b> 2 </b> <div>   </div><div> </div>  &#160;3</div>
1

  2
 3
//  <span>1<style> text </style><i>2</i></span>3
123
//<div>
//    <div>
//        <div>
//            line1
//        </div>
//    </div>
//</div>
//<div>line2</div>
line1
line2
Serj-Tm
  • 16,581
  • 4
  • 54
  • 61
1

The class below provides an alternate implementation to innerText. It does not emit more than one newline for subsequent divs, because it only considers the tags that differentiate different text contents. Every text node's parent is evaluated to decide if a newline or space is to be inserted. Any tags that do not contain direct text are therefore automatically ignored.

The case you presented provided the same result as you desired. Furthermore:

<div>ABC<br>DEF<span>GHI</span></div>

gives

ABC
DEF GHI

while

<div>ABC<br>DEF<div>GHI</div></div>

gives

ABC
DEF
GHI

since div is a block tag. script and style elements are ignored completely. The HttpUtility.HtmlDecode utility method (in System.Web) is used to decode HTML escaped text like &amp;. Multiple occurrences of whitespace (\s+) are replaced by a single space. br tags will not cause multiple newlines if repeated.

static class HtmlTextProvider
{
    private static readonly HashSet<string> InlineElementNames = new HashSet<string>
    {
        //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente
        "b", "big", "i", "small", "tt", "abbr", "acronym", 
        "cite", "code", "dfn", "em", "kbd", "strong", "samp", 
        "var", "a", "bdo", "br", "img", "map", "object", "q", 
        "script", "span", "sub", "sup", "button", "input", "label", 
        "select", "textarea"
    }; 

    private static readonly Regex WhitespaceNormalizer = new Regex(@"(\s+)", RegexOptions.Compiled);

    private static readonly HashSet<string> ExcludedElementNames = new HashSet<string>
    {
        "script", "style"
    }; 

    public static string GetFormattedInnerText(this HtmlDocument document)
    {
        var textBuilder = new StringBuilder();
        var root = document.DocumentNode;
        foreach (var node in root.Descendants())
        {
            if (node is HtmlTextNode && !ExcludedElementNames.Contains(node.ParentNode.Name))
            {
                var text = HttpUtility.HtmlDecode(node.InnerText);
                text = WhitespaceNormalizer.Replace(text, " ").Trim();
                if(string.IsNullOrWhiteSpace(text)) continue;
                var whitespace = InlineElementNames.Contains(node.ParentNode.Name) ? " " : Environment.NewLine;
                //only 
                if (EndsWith(textBuilder, " ") && whitespace == Environment.NewLine)
                {
                    textBuilder.Remove(textBuilder.Length - 1, 1);
                    textBuilder.AppendLine();
                }
                textBuilder.Append(text);
                textBuilder.Append(whitespace);
                if (!char.IsWhiteSpace(textBuilder[textBuilder.Length - 1]))
                {
                    if (InlineElementNames.Contains(node.ParentNode.Name))
                    {
                        textBuilder.Append(' ');
                    }
                    else
                    {
                        textBuilder.AppendLine();
                    }
                }
            }
            else if (node.Name == "br" && EndsWith(textBuilder, Environment.NewLine))
            {
                textBuilder.AppendLine();
            }
        }
        return textBuilder.ToString().TrimEnd(Environment.NewLine.ToCharArray());
    }

    private static bool EndsWith(StringBuilder builder, string value)
    {
        return builder.Length > value.Length && builder.ToString(builder.Length - value.Length, value.Length) == value;
    }
}
Bas
  • 26,772
  • 8
  • 53
  • 86
  • Fails on `
    line1
    line2`
    – Alex from Jitbit May 07 '15 at 12:52
  • @jitbit It depends on the parent of the sample you provide. It fails on `
    line1
    line2
    `. It appears a newline should be forced both after and before, similar to the real logic for block elements. I changed the logic to accommodate this.
    – Bas May 07 '15 at 16:01
1

I don't believe SO is about exchanging bounties for writing complete code solutions. I think the best answers are those that give guidance and help you solve it yourself. In that spirit here's a process that occurs to me should work:

  1. Replace any lengths of whitespace characters with a single space (this is to represent the standard HTML whitespace processing rules)
  2. Replace all instances of </div> with newlines
  3. Collapse any multiple instances of newlines with a single newline
  4. Replaces instances of </p>, <br> and <br/> with a newline
  5. Remove any remaining html open/close tags
  6. Expand any entities e.g. &trade; as required
  7. Trim the output to remove trailing and leading spaces

Basically, you want one newline for each paragraph or line break tab, but to collapse multiple div closures with a single one - so do those first.

Finally note that you are really performing HTML layout, and this depends on the CSS of the tags. The behaviour you see occurs because divs default to the block display/layout mode. CSS would change that. There is no easy way to a general solution for this problem without a headless layout/rendering engine, i.e. something that can process CSS.

But for your simple example case, the above approach should be sound.

Kieren Johnstone
  • 41,277
  • 16
  • 94
  • 144
  • opening `

    ` should also be replaced with newlines in some cases

    – Alex from Jitbit May 07 '15 at 08:15
  • Might I ask - if you know all of the above, and the rule you indicate, what's stopping you writing the code yourself? It doesn't look like there's much of a SO question left - if you're looking for a library or someone to write code, I'm not sure that's in the remit here? – Kieren Johnstone May 07 '15 at 12:15
  • I can't! :) I kept baging my head against this problem for too long, cant seem to come up with a simple logic/ruleset that will cover all the cases... Serge Shutlz and Bas have come close enoght though, great answers from both of them. The logic being "ignore html-nodes that do not have direct (un-nested) text in it". – Alex from Jitbit May 07 '15 at 13:19
1

03/2021 update of the top annswer

The update includes HtmlAgilityPack changes (new methods instead of unexisting ones) and HTML decode HTML Entities (ie.  ).

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//comment() | //script | //style | //head") ?? Enumerable.Empty<HtmlNode>())
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span | //label") ?? Enumerable.Empty<HtmlNode>()) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p | //div") ?? Enumerable.Empty<HtmlNode>()) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//br") ?? Enumerable.Empty<HtmlNode>())
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return WebUtility.HtmlDecode(doc.DocumentNode.InnerText.Trim());

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}
Kristian
  • 11
  • 2
0

I don't know much about html-agility-pack but here is a c# alternative.

    public string GetPlainText()
    {
        WebRequest request = WebRequest.Create("URL for page you want to 'stringify'");
        WebResponse response = request.GetResponse();
        Stream data = response.GetResponseStream();
        string html = String.Empty;
        using (StreamReader sr = new StreamReader(data))
        {
            html = sr.ReadToEnd();
        }

        html = Regex.Replace(html, "<.*?>", "\n");

        html = Regex.Replace(html, @"\\r|\\n|\n|\r", @"$");
        html = Regex.Replace(html, @"\$ +", @"$");
        html = Regex.Replace(html, @"(\$)+", Environment.NewLine);

        return html;
    }

If you are intending on showing this in a html page then replace Environment.NewLine with <br/>.

ClassyBear
  • 223
  • 1
  • 9
  • It will replace all tags (even the inline ones, like ``) with a line break. Also, multiple line-breaks are fine if they are intentional (like `


    `)
    – Alex from Jitbit May 02 '15 at 20:54
  • 5
    Nobody should [ever use regex to parse (x)html](http://stackoverflow.com/a/1732454/209259). – Erik Philips May 06 '15 at 17:58
  • @ErikPhilips oh come on! You're not seriously saying someone that just wants to do a simple Replace("
    ", @"\r\n"), HAS TO do it via HTML AP?
    – Fandango68 Nov 15 '16 at 06:58
0

Below code works for me :

 static void Main(string[] args)
        {
              StringBuilder sb = new StringBuilder();
        string path = new WebClient().DownloadString("https://www.google.com");
        HtmlDocument htmlDoc = new HtmlDocument();
        ////htmlDoc.LoadHtml(File.ReadAllText(path));
        htmlDoc.LoadHtml(path);
        var bodySegment = htmlDoc.DocumentNode.Descendants("body").FirstOrDefault();
        if (bodySegment != null)
        {
            foreach (var item in bodySegment.ChildNodes)
            {
                if (item.NodeType == HtmlNodeType.Element && string.Compare(item.Name, "script", true) != 0)
                {
                    foreach (var a in item.Descendants())
                    {
                        if (string.Compare(a.Name, "script", true) == 0 || string.Compare(a.Name, "style", true) == 0)
                        {
                            a.InnerHtml = string.Empty;
                        }
                    }
                    sb.AppendLine(item.InnerText.Trim());
                }
            }
        }


            Console.WriteLine(sb.ToString());
            Console.Read();
        }
Dreamweaver
  • 1,328
  • 11
  • 21
  • Try with `htmlDoc.LoadHtml(new WebClient().DownloadString("https://www.google.com"))` ;) – Bas May 06 '15 at 18:57
  • Edited the code, but it seems that google page is having too much of script inside each divs, so the result is too confusing. I tried comparing the data and getting confused so though to get some help :) ... Please lemme know how good the above code works . – Dreamweaver May 06 '15 at 19:58
  • new WebClient().DownloadString("http://stackoverflow.com/"); the result looks bit better and easy to compare... Please validate.. – Dreamweaver May 06 '15 at 20:01
  • One doubt I have: - Do We need each nested div's data on separate line ? – Dreamweaver May 06 '15 at 20:10
  • This will add New-line for every element, not just block-elements – Serge Shultz May 06 '15 at 21:14
  • Of course, `span` etc should not generate a line break – Alex from Jitbit May 07 '15 at 12:58
  • Just clear my understanding about the output :- 1) if html has
    te
    sttest1, then what output is expected ? 2) if html has
    testtest1
    , then what output is expected ? 3) if div has a table inside it then how table data should be displayed ? this will helpin modifying above code
    – Dreamweaver May 07 '15 at 20:32
0

I always use CsQuery for my projects. It's supposedly faster than HtmlAgilityPack and much easier to use with css selectors instead of xpath.

var html = @"<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>";

var lines = CQ.Create(html)
              .Text()
              .Replace("\r\n", "\n") // I like to do this before splitting on line breaks
              .Split('\n')
              .Select(s => s.Trim()) // Trim elements
              .Where(s => !s.IsNullOrWhiteSpace()) // Remove empty lines
              ;

var result = string.Join(Environment.NewLine, lines);

The above code works as expected, however if you have a more complex example with an expected result, this code can be easily accommodated.

If you want to preserve <br> for example, you can replace it with something like "---br---" in the html variable and split on it again in the final result.

imlokesh
  • 2,506
  • 2
  • 22
  • 26
  • Ahem... It does not work. Test with `
    line1
    line2`. Also - why remove empty lines?? What if they are intended? Like `

    ` Also, CsQuery is slower. See the chart here: https://github.com/FlorianRappl/AngleSharp/wiki/Performance
    – Alex from Jitbit May 07 '15 at 11:28
0

No-regex solution:

while (text.IndexOf("\n\n") > -1 || text.IndexOf("\n \n") > -1)
{
    text = text.Replace("\n\n", "\n");
    text = text.Replace("\n \n", "\n");
}

Regex:

text = Regex.Replace(text, @"^\s*$\n|\r", "", RegexOptions.Multiline).TrimEnd();

Also, as I remember,

text = HtmlAgilityPack.HtmlEntity.DeEntitize(text);

does the favor.

Raman Sinclair
  • 1,194
  • 17
  • 31
0

The top answer didn't work for me; my contribution below I think would be fast and light as it doesn't need to query the document, it recursively visits each node to find text nodes, using three bookkeeping flags to handle whitespace around inline and block elements.

using System;
using System.Text;
using HtmlAgilityPack;

public class HtmlToTextConverter {

    public static string Convert(string html) {
        var converter = new HtmlToTextConverter();
        converter.ParseAndVisit(html);
        return converter.ToString();
    }

    private readonly StringBuilder _text = new();
    private bool _atBlockStart = true;
    private bool _atBlockEnd = false;
    private bool _needsInlineWhitespace;

    public override string ToString() => _text.ToString();

    public void ParseAndVisit(string html) {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        Visit(doc);
    }

    public void Visit(HtmlDocument doc) => Visit(doc.DocumentNode);

    public void Visit(HtmlNode node) {
        switch (node.NodeType) {
            case HtmlNodeType.Document:
                VisitChildren(node);
                break;

            case HtmlNodeType.Comment:
                break;

            case HtmlNodeType.Text:
                WriteText((node as HtmlTextNode).Text);
                break;

            case HtmlNodeType.Element:
                switch (node.Name) {
                    case "script":
                    case "style":
                    case "head":
                        break;

                    case "br":
                        _text.AppendLine();
                        _atBlockStart = true;
                        _atBlockEnd = false;
                        _needsInlineWhitespace = false;
                        break;

                    case "p":
                    case "div":
                        MarkBlockStart();
                        VisitChildren(node);
                        MarkBlockEnd();
                        break;

                    default:
                        VisitChildren(node);
                        break;
                }
                break;
        }
    }

    private void MarkBlockStart() {
        _atBlockEnd = false;
        _needsInlineWhitespace = false;
        if (!_atBlockStart) {
            _text.AppendLine();
            _atBlockStart = true;
        }
    }

    private void MarkBlockEnd() {
        _atBlockEnd = true;
        _needsInlineWhitespace = false;
        _atBlockStart = false;
    }

    private void WriteText(string text) {
        if (string.IsNullOrWhiteSpace(text)) {
            return;
        }

        if (_atBlockStart || _atBlockEnd) {
            text = text.TrimStart();
        }

        // This would mean this is the first text after a block end,
        // e.g., "...</p>this text"
        if (_atBlockEnd) {
            _text.AppendLine();
        }

        if (_needsInlineWhitespace) {
            _text.Append(" ");
        }

        var trimmedText = text.TrimEnd();
        if (trimmedText != text) {
            // This text has trailing whitespace; if more inline content
            // comes next, we'll need to add a space then; if a block start
            // or block end comes next, we should ignore it.
            _needsInlineWhitespace = true;
        } else {
            _needsInlineWhitespace = false;
        }

        _text.Append(trimmedText);
        _atBlockStart = false;
        _atBlockEnd = false;
    }

    private void VisitChildren(HtmlNode node) {
        if (node.ChildNodes != null) {
            foreach (var child in node.ChildNodes) {
                Visit(child);
            }
        }
    }

}

pettys
  • 2,293
  • 26
  • 38