2

In C#, how do I get the text of an System.Windows.Form.HtmlElement not including the text from its children?

If I have

<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>

then the InnerText property of the whole thing is "aaabbbcccddd" and I just want "aaa".

I figure this should be trivial, but I haven't found anything to produce the "immediate" text of an HtmlElement in C#. More ludicrous ideas are "subtracting" the InnerText of the children from the parent, but that's an insane amount of work for something that I'm sure is trivial.

(All I want is access to the Text Node of the HtmlElement.)

I'd certain appreciate any help (or pointer) that anyone can supply.

Many thanks.

Examples:

<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>  -> Produce "aaa"
<div><div>ccc</div><div>ddd</div></div>                   -> Produce ""
<div>ccc</div>                                            -> Produce "ccc" 

Edit

There are a number of ways to skin this particular cat, none of them elegant. However, given my constraints (not my HTML, quite possibly not valid), I think Aleksey Bykov's solution is closest to what I needed (and indeed, I did implement the same solution he suggested in the last comment.)

I've selected his solution and upvoted all the other ones that I think would work, but weren't optimal for me. I'll check back to upvote any other solutions that seem likely to work.

Many thanks.

Tom West
  • 1,759
  • 2
  • 13
  • 20
  • 1
    Two ways you could do this. 1) Subtract the innertext of each of the child elements from the inner text of the parent (edit: just noticed you don't want to do this, sorry) 2) Create a duplicate of the element and remove all children, then get innertext. – Asad Saeeduddin Nov 11 '13 at 04:21
  • Indeed, I'm considering both, but I have to think there's got to be an easier way. I may look at HtmlAgilityPack if .NET access to the DOM is really that brain-damaged. – Tom West Nov 11 '13 at 04:38
  • 1
    Kind of depends on which `HtmlElemnt` you're talking about. `System.Web.UI.HtmlElement`? `System.Windows.Forms.HtmlElement`? `System.Windows.Browser.HtmlElement`? You should be able to find all of the immediate child nodes of the HtmlElement and only pick the ones which are text nodes. – Heretic Monkey Nov 11 '13 at 04:47
  • Thanks, I forgot there were two. I've modified the question to refer to System.Windows.Forms.HtmlElement. But how to do I tell which child nodes are the text nodes? (Or just get the child nodes?) – Tom West Nov 11 '13 at 04:58

3 Answers3

1

Maybe it's simpler than that, if you're willing to use XmlDocument instead of HtmlDocument - you can just use the 'Value' property of the XmlElement.

This code gives the output you want for the 3 cases you mentioned:

class Program
{
    private static string[] htmlTests = {@"<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>",
                                         @"<div><div>ccc</div><div>ddd</div></div>",
                                         @"<div>ccc</div>" };
    static void Main(string[] args)
    {
        var page = new XmlDocument();

        foreach (var test in htmlTests)
        {
            page.LoadXml(test);
            Console.WriteLine(page.DocumentElement.FirstChild.Value);
        }
    }
}

Output:

aaa

ccc
Baldrick
  • 11,712
  • 2
  • 31
  • 35
  • Indeed, and the document can be constructed from `originalHTMLElement.OuterHtml`. Whether this can be used depends on whether the OP is dealing with well formed HTML though. – Asad Saeeduddin Nov 11 '13 at 05:03
  • it may or may not work in a general case since HTML != XML, there are situations when valid HTML cannot be parsed to XML (for example problems with namespaces) – Trident D'Gao Nov 11 '13 at 05:08
  • I guess the usefulness of this solution really depends on the OP's real-world requirements - the examples posted were certainly valid XML. – Baldrick Nov 11 '13 at 05:13
  • It's true this solves the problem I posted, my real world post involves customers HTML (i.e. probably not valid :-(.) Many thanks. – Tom West Nov 11 '13 at 05:23
0

I am not sure what you mean by HtmlElement, but with XmlElement you would do it like this:

using System;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Text;

public static class XmlUtils {

    public static IEnumerable<String> GetImmediateTextValues(XmlNode node) {
        var values = node.ChildNodes.Cast<XmlNode>().Aggregate(
            new List<String>(),
            (xs, x) => { if (x.NodeType == XmlNodeType.Text) { xs.Add(x.Value); } return xs; }
        );
        return values;
    }

    public static String GetImmediateJoinedTextValues(XmlNode node, String delimiter) {
        var values = GetImmediateTextValues(node);
        var text = String.Join(delimiter, values.ToArray());
        return text;
    }
}

EDIT:

Well, if your HtmlElement comes from System.Windows.Forms, then what you need to do is to use its DomElement property trying to cast it to one of the COM interfaces defined in mshtml. So all you need to do is to be able to tell if the element you are looking at is a text node and get its value. First you gotta add a reference to the mshtml COM library. You can do something like this (I cannot verify this code immediately).

public Bool IsTextNode(HtmlElement element) {
  var result = false;
  var nativeNode = element.DomElement as mshtml.IHTMLDOMNode;
  if (nativeNode != null) {
      var nodeType = nativeNode.nodeType;
      result = nodeType == 3; // -- TextNode: http://msdn.microsoft.com/en-us/library/aa704085(v=vs.85).aspx
  }
  return result

}

Trident D'Gao
  • 18,973
  • 19
  • 95
  • 159
  • 1
    http://msdn.microsoft.com/en-us/library/System.Windows.Forms.HtmlElement(v=vs.110).aspx. Your approach unfortunately doesn't work, since HTMLElement lacks a way to refer to text nodes. – Asad Saeeduddin Nov 11 '13 at 04:37
  • since you are going to deal with mshtml anyway I'd recommend you to use interfaces from there instead of HtmlElement from System.Windows.Forms because it's so half-baked – Trident D'Gao Nov 11 '13 at 05:05
  • Unfortunately, it appears that all System.Windows.Forms.HtmlElement objects are Element nodes. Perhaps not all that surprising. On the other hand, I suspect I can get the text node using mshtml. (Or give up on the HtmlElement, anyway) – Tom West Nov 11 '13 at 05:07
  • 1
    @TomWest, well it seems they only exposed html elements in Windows.Forms omitting text nodes, don't panic, all is not lost, what you still can do is to cast your parent HTML element (which is a HtmlElement by any means) to `IHTMLDOMNode` and then get it's first child using `firstNode` then on the first child use `nextSibling` to get from the first to the next etc., checking `nodeType` along the way. If it `nodeType == 3` then you cast it to `IHTMLDOMTextNode` and use the `data` property to extract text. – Trident D'Gao Nov 11 '13 at 05:19
  • Exactly the way I just did it. Thanks for giving me the clue (mshtml) and the code to solve my problem. Amazing that they don't provide that functionality in .NET – Tom West Nov 11 '13 at 05:21
-1

Well, you could do something like this (assuming your input is in a string called `input'):

string pattern = @">.*?<";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase); 

MatchCollection matches = rgx.Matches(input);
var first_match = matches[0].ToString();
string result = first_match.Substring(1, first_match.Length - 2);

I probably wouldn't do it (or just relay on matching the string for the first <div> and </div>) ... here, for extra credit:

int start = pattern.IndexOf(">") + 1;
int end = pattern.IndexOf("<", start);
string result = input.Substring(start, end - start);
Noctis
  • 11,507
  • 3
  • 43
  • 82
  • While that would work in many cases, using Regex to parse HTML is generally considered a very large no-no. – Tom West Nov 11 '13 at 05:30
  • 1
    I totally agree. More than that, I love [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) as to why not do it :) . Just gave another option so he can choose from. – Noctis Nov 11 '13 at 05:36
  • Thank you, Noctis. I just had the best laugh all day. – Tom West Nov 11 '13 at 05:49
  • My pleasure. Smart cookie he is, isn't he? – Noctis Nov 11 '13 at 05:52