0

After WebBrowser document loads, Its document contains something like:

<div id="toextract">
    <div>This</div>
    <div>is</div>
    Sample
    <div>text</div>
    I
    <div>want to</div>
    <div>Extract</div>
</div>

I want to extract InnerHtml of these elements so that the output would be:

This is Sample text I want to Extract

but i get this:

This is text want to Extract

as the word I and Sample are not in an HtmlElement. this is my code:

string Ex = "";
HtmlElement elem = webBrowser1.Document.GetElementById("toextract");
HtmlElementCollection elems = elem.All
for(int i=0;i<elems.Count;i++)
    Ex += elems[i].InnerHtml + " ";

my code skips text-nodes (nodes with no tag). I think its because they are not considered as HtmlElement. How can include them in my extracted text?

Ashkan Mobayen Khiabani
  • 33,575
  • 33
  • 102
  • 171

2 Answers2

2

simply fetch the text with

elem.InnerText

and remove any linefeeds like this

elem.InnerText.Replace(System.Environment.NewLine, " ")

flo scheiwiller
  • 2,706
  • 2
  • 17
  • 15
0

Try changing from elements to childNodes and then stripping away the unneeded whitespaces and line breaks. Something like this (not yet tested):

string Ex = "";
HtmlElement elem = webBrowser1.Document.GetElementById("toextract");
NodeList nodes = elem.childNodes;
for(int i=0;i<nodes.Count;i++)
    Ex += nodes[i].data + " ";
Ex = Regex.Replace(Ex, @"(?:(?:\r?\n)+ +){2,}", @" ");

For similar Q&A, see best way to get child nodes and How to remove extra returns and spaces in a string by regex?

Community
  • 1
  • 1
Marcus
  • 3,459
  • 1
  • 26
  • 25