getting text-nodes of HtmlDocument

Question

After WebBrowser document loads, Its document contains something like:

<div id="toextract">
    <div>This</div>
    <div>is</div>
    Sample
    <div>text</div>
    I
    <div>want to</div>
    <div>Extract</div>
</div>

I want to extract InnerHtml of these elements so that the output would be:

This is Sample text I want to Extract

but i get this:

This is text want to Extract

as the word I and Sample are not in an HtmlElement. this is my code:

string Ex = "";
HtmlElement elem = webBrowser1.Document.GetElementById("toextract");
HtmlElementCollection elems = elem.All
for(int i=0;i<elems.Count;i++)
    Ex += elems[i].InnerHtml + " ";

my code skips text-nodes (nodes with no tag). I think its because they are not considered as HtmlElement. How can include them in my extracted text?

you assume right, the html is not entirely valid. Do you expect other violations other than missing tags? (like non closing tags
text
other text
? — flo scheiwiller, Mar 23 '14 at 20:41

flo scheiwiller · Accepted Answer · 2014-03-24T09:40:21.813

2

simply fetch the text with

elem.InnerText

and remove any linefeeds like this

elem.InnerText.Replace(System.Environment.NewLine, " ")

edited Mar 24 '14 at 09:40

answered Mar 23 '14 at 21:15

flo scheiwiller

2,706
2
17
15

score 0 · Answer 2 · edited May 23 '17 at 12:21

Try changing from elements to childNodes and then stripping away the unneeded whitespaces and line breaks. Something like this (not yet tested):

string Ex = "";
HtmlElement elem = webBrowser1.Document.GetElementById("toextract");
NodeList nodes = elem.childNodes;
for(int i=0;i<nodes.Count;i++)
    Ex += nodes[i].data + " ";
Ex = Regex.Replace(Ex, @"(?:(?:\r?\n)+ +){2,}", @" ");

For similar Q&A, see best way to get child nodes and How to remove extra returns and spaces in a string by regex?

getting text-nodes of HtmlDocument

2 Answers2