0

i use this method to convert html to plaint text but it have some bugs in this html tags <H1,2,3,..>

Method :

public string HtmlToPlainText(string htmlText)
    {
        //const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
        const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
        const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
        var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
        var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
        //var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

        var text = htmlText;
        //Decode html specific characters
        text = System.Net.WebUtility.HtmlDecode(text);
        //Remove tag whitespace / line breaks
        //text = tagWhiteSpaceRegex.Replace(text, "><");
        //Replace < br /> with line breaks
        text = lineBreakRegex.Replace(text, Environment.NewLine);
        //Strip formatting
        text = stripFormattingRegex.Replace(text, string.Empty);
        return text;
    }

this is my html text :

<h3> This is a simple title </h3>
</br>
<p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>

This is my result :

This is a simple title Lorem ipsum dolor sit amet consectetur,
adipisicing elit.

The result should be :

This is a simple title

Lorem ipsum dolor sit amet consectetur, adipisicing elit.

I think the error is from Strip formatting. How can i solve it?

  • 1
    You shouldn't use regex to extract data from html. – Rana Dec 25 '21 at 13:05
  • Did you mean `
    ` instead of ``?
    – Andrew Morton Dec 25 '21 at 13:07
  • Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Rana Dec 25 '21 at 13:07
  • Why did you disclose that your question come a solution posted [here](https://stackoverflow.com/a/16407272/1248177)? – aloisdg Dec 25 '21 at 13:17
  • Does this answer your question? [How do you convert Html to plain text?](https://stackoverflow.com/questions/286813/how-do-you-convert-html-to-plain-text) – aloisdg Dec 25 '21 at 13:27
  • Please explain (with examples, test-cases, etc.) how your are not a duplicate of [this Q&A](https://stackoverflow.com/a/16407272/1248177). – aloisdg Dec 25 '21 at 13:28

1 Answers1

2

Parsing HTML is not an easy task (even for a subset of HTML). If regex feels like a good solution for this task it is actually not that great. To parse HTML, you should use ... an HTML parser. In C#, AngleSharp and the HTMLAgilityPack are the most common solution. Here is an example with AngleSharp:

using System;
using AngleSharp;
using AngleSharp.Html.Parser;

class MyClass {
    static void Main() {
        //Use the default configuration for AngleSharp
        var config = Configuration.Default;

        //Create a new context for evaluating webpages with the given config
        var context = BrowsingContext.New(config);

        //Source to be parsed
        var source = @"<h3> This is a simple title </h3>
</br>
<p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>
";

        //Create a parser to specify the document to load (here from our fixed string)
        var parser = context.GetService<IHtmlParser>();
        var document = parser.ParseDocument(source);

        //Do something with document like the following
        Console.WriteLine(document.DocumentElement.TextContent);
    }
}

Try it Online

aloisdg
  • 22,270
  • 6
  • 85
  • 105