2

I'm currently using HtmlAgilityPack to search for certain content via an xpath query. Something like this:

var col = doc.DocumentNode.SelectNodes("//*[text()[contains(., 'foo'] or @*....

Now I want to search for specific content in all of the html sourcecode (= text, tags and attributes) using a regular expression. How can this be achived with HtmlAgilityPack? Can HtmlAgilityPack handle xpath+regex or what would be the best way of using a regex and HtmlAgilityPack to search?

juFo
  • 17,849
  • 10
  • 105
  • 142
  • 1
    possible duplicate: http://stackoverflow.com/a/11729611/2186023 since that entry is over a year old I looked into the [history](http://htmlagilitypack.codeplex.com/SourceControl/list/changesets) there seems to be no such functionality added since then, so you will probably have to use basic c#-regex functionality and use that in conjunction with HtmlAgilityPack (maybe you don't need HtmlAgilityPack at all any more, since you say you're searching `all of the html`) – DrCopyPaste Nov 20 '14 at 11:33
  • I'm using already HtmlAgilityPack for other purposes, so would be nice to do everything with HtmlAgilityPack. – juFo Nov 20 '14 at 13:54
  • Well you still can, but I think in that case it would be only useful to narrow down the source code that actually needs to be matched against regex. – DrCopyPaste Nov 20 '14 at 14:15
  • Good question. I never came across the other mentioned post :-) – Simon Mourier Nov 21 '14 at 11:27
  • whats the regular expression? Consider the classic http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jodrell Nov 21 '14 at 11:31

2 Answers2

7

The Html Agility Pack uses the underlying .NET XPATH implementation for its XPATH support. Fortunately XPATH in .NET is fully extensible (BTW: it's a shame Microsoft doesn't invest any more in this superb technology...).

So, let's suppose I have this html:

<div>hello</div>
<div>hallo</div>

Here is a sample code that will select both node because it compares the nodes with the 'h.llo' regex expression:

HtmlNodeNavigator nav = new HtmlNodeNavigator("mypage.htm");
foreach (var node in SelectNodes(nav, "//div[regex-is-match(text(), 'h.llo')]"))
{
    Console.WriteLine(node.OuterHtml); // should dump both div elements
}

It works because I use a special Xslt/XPath context where I have defined a new XPATH function called "regex-is-match". Here is the SelectNodes utility code:

public static IEnumerable<HtmlNode> SelectNodes(HtmlNodeNavigator navigator, string xpath)
{
    if (navigator == null)
        throw new ArgumentNullException("navigator");

    XPathExpression expr = navigator.Compile(xpath);
    expr.SetContext(new HtmlXsltContext());

    object eval = navigator.Evaluate(expr);
    XPathNodeIterator it = eval as XPathNodeIterator;
    if (it != null)
    {
        while (it.MoveNext())
        {
            HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
            if (n != null && n.CurrentNode != null)
            {
                yield return n.CurrentNode;
            }
        }
    }
}

And here is the support code:

    public class HtmlXsltContext : XsltContext
    {
        public HtmlXsltContext()
            : base(new NameTable())
        {
        }

        public override int CompareDocument(string baseUri, string nextbaseUri)
        {
            throw new NotImplementedException();
        }

        public override bool PreserveWhitespace(XPathNavigator node)
        {
            throw new NotImplementedException();
        }

        protected virtual IXsltContextFunction CreateHtmlXsltFunction(string prefix, string name, XPathResultType[] ArgTypes)
        {
            return HtmlXsltFunction.GetBuiltIn(this, prefix, name, ArgTypes);
        }

        public override IXsltContextFunction ResolveFunction(string prefix, string name, XPathResultType[] ArgTypes)
        {
            return CreateHtmlXsltFunction(prefix, name, ArgTypes);
        }

        public override IXsltContextVariable ResolveVariable(string prefix, string name)
        {
            throw new NotImplementedException();
        }

        public override bool Whitespace
        {
            get { return true; }
        }
    }

    public abstract class HtmlXsltFunction : IXsltContextFunction
    {
        protected HtmlXsltFunction(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
        {
            Context = context;
            Prefix = prefix;
            Name = name;
            ArgTypes = argTypes;
        }

        public HtmlXsltContext Context { get; private set; }
        public string Prefix { get; private set; }
        public string Name { get; private set; }
        public XPathResultType[] ArgTypes { get; private set; }

        public virtual int Maxargs
        {
            get { return Minargs; }
        }

        public virtual int Minargs
        {
            get { return 1; }
        }

        public virtual XPathResultType ReturnType
        {
            get { return XPathResultType.String; }
        }

        public abstract object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext);

        public static IXsltContextFunction GetBuiltIn(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
        {
            if (name == "regex-is-match")
                return new RegexIsMatch(context, name);

            // TODO: create other functions here
            return null;
        }

        public static string ConvertToString(object argument, bool outer, string separator)
        {
            if (argument == null)
                return null;

            string s = argument as string;
            if (s != null)
                return s;

            XPathNodeIterator it = argument as XPathNodeIterator;
            if (it != null)
            {
                if (!it.MoveNext())
                    return null;

                StringBuilder sb = new StringBuilder();
                do
                {
                    HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
                    if (n != null && n.CurrentNode != null)
                    {
                        if (sb.Length > 0 && separator != null)
                        {
                            sb.Append(separator);
                        }

                        sb.Append(outer ? n.CurrentNode.OuterHtml : n.CurrentNode.InnerHtml);
                    }
                }
                while (it.MoveNext());
                return sb.ToString();
            }

            IEnumerable enumerable = argument as IEnumerable;
            if (enumerable != null)
            {
                StringBuilder sb = null;
                foreach (object arg in enumerable)
                {
                    if (sb == null)
                    {
                        sb = new StringBuilder();
                    }

                    if (sb.Length > 0 && separator != null)
                    {
                        sb.Append(separator);
                    }

                    string s2 = ConvertToString(arg, outer, separator);
                    if (s2 != null)
                    {
                        sb.Append(s2);
                    }
                }
                return sb != null ? sb.ToString() : null;
            }

            return string.Format("{0}", argument);
        }

        public class RegexIsMatch : HtmlXsltFunction
        {
            public RegexIsMatch(HtmlXsltContext context, string name)
                : base(context, null, name, null)
            {
            }

            public override XPathResultType ReturnType { get { return XPathResultType.Boolean; } }
            public override int Minargs { get { return 2; } }

            public override object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext)
            {
                if (args.Length < 2)
                    return false;

                return Regex.IsMatch(ConvertToString(args[0], false, null), ConvertToString(args[1], false, null));
            }
        }
    }

The regex function is implemented in a class called RegexIsMatch at the end. It's not super complicated. Note there is a utility function ConvertToString that tries to coerce any xpath "thing" into a string that's very useful.

Of course, with this technology, you can define whatever XPATH function you need with very little code (I use this all the time to do upper/lower case conversions...).

Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • Excellent solution! Great implementation. I'd suggest changing the `SelectNodes` function a bit (making it an extension method for `HtmlNode`) and returning `HtmlNodeCollection` instead, like the built-in HtmlAgilityPack's `SelectNodes()`. – Elad Nava Nov 01 '15 at 08:54
  • This is arguable. I prefer using yield which allocates and browses collections only when needed. Then use Linq's library (ToArray, ToList, etc.) only when needed. Html Agility Pack was written well before yield even existed. If I was creating it today, I would certainly return an enumerable instead of a collection. – Simon Mourier Nov 01 '15 at 11:22
  • Makes sense. However, my project already implemented `HtmlNodeCollection` all over the place, so this seemed like a logical solution for me. – Elad Nava Nov 01 '15 at 22:06
  • This looks like the perfect solution.. but I'm having trouble implementing it. I get the error "The name 'HtmlXsltFunction' does not exist in the current context'. Any ideas? – Moss Palmer May 24 '16 at 14:13
0

Directly quoting,

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

It might make sense to use a regular expression with some parts of an HTML document. Trying to use HtmlAgilityPack to run a regular expression on the tags and structure of an HTML document is perverse and ultimately, cannot provide a universal solution to your problem.

Community
  • 1
  • 1
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • good and very interesting answer, but I didn't want a reason why or why not, that is why I gave the bounty to Simon Mourier. His solution makes it easy to implement also other functions, which made makes my project a lot "faster". – juFo Nov 21 '14 at 14:39