2
<div id="div1">
    <span>Span text 1</span>
    <span>Span text 2</span>
    Div Inner Text
</div>

How to extract only the div1 text (Div Inner Text)?

div1.innerText returns and the span's text.

Stanislav Stoyanov
  • 2,082
  • 2
  • 20
  • 22
  • C# and webbrowser? don't you mean Javascript? Also, what have you tried? It may help understand the how and what you're trying to do. – Yanick Rochon Feb 07 '11 at 15:59
  • .NET WebBrowser component. Tried HttpElement.InnerText it returns inner tag's text. Tried to remove all child element, component has no such property or method. – Stanislav Stoyanov Feb 07 '11 at 16:03

2 Answers2

0

There are similar questions regarding fetching an element's inner text.

  • Solution 1 : see this question

    HtmlElement e1 = webBrowser1.Document.GetElementById("elementId");
    string content = e1.InnerText
    MessageBox.Show(content);
    
  • Solution 2 : use Javascript using HtmlDocument.InvokeScript method

    In your HTML :

    <script type="text/javascript">
        function getInnerText(id) {
           return document.getElementById(id)..innerText;
        }
    </script>
    

    C#

    Object[] objArray = new Object[1];
    objArray[0] = (Object)"elementId";
    string content = webBrowser1.Document.InvokeScript("getInnerText", objArray);
    MessageBox.Show(content);
    
Community
  • 1
  • 1
Yanick Rochon
  • 51,409
  • 25
  • 133
  • 214
  • Solution 1: This will strip any html tags and will return "Span text1 Span tex2 Div Inner Text". I want only the "Div Inner Text". Solution 2 is fine but I can not alter the source html. – Stanislav Stoyanov Feb 07 '11 at 17:11
0

The approach I would take it to iterate over child nodes, test if each is a textnode and if it is store it in an array and then return the elements of the array concatenated.

  function innerText(element){
    var i, text = [], child = null;
    for(i = 0; i < element.childNodes.length; i++){
      child = element.childNodes[i]

      if (child.nodeType === 3 &&
        child.nodeValue.match(/[^\n\s\t\r]/)){
        text.push(child.nodeValue);
      }
    }
    return text.join("");
  }
  // Example call
  alert(innerText(document.getElementById("div1")));

The code above uses the nodeValue property of DOMElements to check whether a node is a text node (nodeValue === 3) and that the element contains more than whitespace. The result could be tidied by trimming leading and trailing whitespace.

Edit: C# use

Using Yanick's code as a template as it seems straight forward. Update the JavaScript function to;

  function innerText(id){

    var i, text = [], child = null, element = document.getElementById(id);
    for(i = 0; i < element.childNodes.length; i++){
      child = element.childNodes[i]

      if (child.nodeType === 3 &&
        child.nodeValue.match(/[^\n\s\t\r]/)){
        text.push(child.nodeValue);
      }
    }
    return text.join("");
  }

Then it can be called using:

string content = 
  (string)webBrowser1.Document.InvokeScript("innerText", 
                                            new string[] { "div1" });

The variable content will contain the inner text value. This doesn't check that the id passed to the function exists so additional checks would be required for a real world application.

Community
  • 1
  • 1
detaylor
  • 7,112
  • 1
  • 27
  • 46