2

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.

For example we have this piece of HTML: <h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>

Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?

j1rjacob
  • 411
  • 11
  • 27
PovilasZ
  • 191
  • 1
  • 1
  • 15
  • You can take a substring to take all strings after ">" and all strings before "<" – Adas Feb 08 '19 at 13:02
  • What would you want to use to separate the two phrases? What would determine when one phrase ends and the next begins? – Andy G Feb 08 '19 at 13:02
  • https://html-agility-pack.net/ will allow you to parse HTML pretty successfully and gain access to all parts of the HTML (including tags and inner text). – Neil Feb 08 '19 at 13:03
  • Space between words would work for me. Just want to make sure words don't get blended. – PovilasZ Feb 08 '19 at 13:03
  • Yeah... a simple [regex](https://stackoverflow.com/a/1732454/1336590) will do... () – Corak Feb 08 '19 at 13:04
  • 1
    RegEx is not a good answer for this. Sure you might find you can get it to work 99% of the time, but HTML is not XML. It's too irregular for regular expressions. – Neil Feb 08 '19 at 13:05
  • Depending on how intricate or simple the HTML might be, I suppose you could initially replace all `
    ` with spaces before extracting the plain text content.
    – Andy G Feb 08 '19 at 13:09
  • Html agility pack oneliner `string.Join("\n", htmlDoc.DocumentNode.ChildNodes.Select(x=> x.InnerText));` each node text will be on a line but you can Join on a simple space. – Drag and Drop Feb 08 '19 at 13:44

4 Answers4

4

It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc

var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);

var result = string.Join(
                "\n", 
                htmlDocument
                    .DocumentNode
                    .ChildNodes
                    .Select(x=> x.InnerText)
                    .Where(x=>!string.IsNullOrWhiteSpace(x))
              );

Result:

"Some text\nSome more text"

Drag and Drop
  • 2,672
  • 3
  • 25
  • 37
2

easy way to do it is to use HTML Agility pack:

HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText
Or Yaacov
  • 3,597
  • 5
  • 25
  • 49
  • This is giving the same result `Some textSome more text` while expected result is `Some text Some more text` – PovilasZ Feb 08 '19 at 13:40
  • @Sparrow so you should 1. choose the html element that contains them both. or 2. choose each one of them and concat the string. but that's not the elegant way to do it. – Or Yaacov Feb 08 '19 at 13:52
0

You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.

public static string GetPlainTextFromHTML(string inputText)
    {
        // Extracted plain text
        var plainText = string.Empty;

        if(string.IsNullOrWhiteSpace(inputText))
        {
            return plainText;
        }

        var htmlNote = new HtmlDocument();
        htmlNote.LoadHtml(inputText);

        var nodes = htmlNote.DocumentNode.ChildNodes;
        if(nodes == null)
        {
            return plainText;
        }

        StringBuilder innerString = new StringBuilder();

        // Replace <p> with new lines
        foreach (HtmlNode node in nodes) 
        {
            innerString.Append(node.InnerText);
            innerString.Append("\\n");
        }

        plainText = innerString.ToString();
        return plainText;
    }
Hasitha
  • 150
  • 1
  • 13
-1

You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>