76

I am taking a stab at html agility pack and having trouble finding the right way to go about this.

For example:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class"));

However, obviously you can add classes to a lot more then divs so I tried this..

var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes("//*[@class=\"float\"]");

But that doesn't handle the cases where you add multiple classes and "float" is just one of them like this..

class="className float anotherclassName"

Is there a way to handle all of this? I basically want to select all nodes that have a class = and contains float.

**Answer has been documented on my blog with a full explanation at: Html Agility Pack Get All Elements by Class

Adam
  • 3,615
  • 6
  • 32
  • 51

6 Answers6

97

(Updated 2018-03-17)

The problem:

The problem, as you've spotted, is that String.Contains does not perform a word-boundary check, so Contains("float") will return true for both "foo float bar" (correct) and "unfloating" (which is incorrect).

The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b. So the regex you want is simply: \bfloat\b.

A downside to using a Regex instance is that they can be slow to run if you don't use the .Compiled option - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.

Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split).

Approach 1: Using a regular-expression:

Suppose you just want to look for elements with a single, design-time specified class-name:

class Program {

    private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );

    private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
        return doc
            .Descendants()
            .Where( n => n.NodeType == NodeType.Element )
            .Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
    }
}

If you need to choose a single class-name at runtime then you can build a regex:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled );

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}

If you have multiple class-names and you want to match all of them, you could create an array of Regex objects and ensure they're all matching, or combine them into a single Regex using lookarounds, but this results in horrendously complicated expressions - so using a Regex[] is probably better:

using System.Linq;

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {

    Regex[] exprs = new Regex[ classNames.Length ];
    for( Int32 i = 0; i < exprs.Length; i++ ) {
        exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled );
    }

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            exprs.All( r =>
                r.IsMatch( e.GetAttributeValue("class", "") )
            )
        );
}

Approach 2: Using non-regex string matching:

The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regex may be faster in some circumstances - always profile your code first, kids!)

This method below: CheapClassListContains provides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            CheapClassListContains(
                e.GetAttributeValue("class", ""),
                className,
                StringComparison.Ordinal
            )
        );
}

/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
    if( String.Equals( haystack, needle, comparison ) ) return true;
    Int32 idx = 0;
    while( idx + needle.Length <= haystack.Length )
    {
        idx = haystack.IndexOf( needle, idx, comparison );
        if( idx == -1 ) return false;

        Int32 end = idx + needle.Length;

        // Needle must be enclosed in whitespace or be at the start/end of string
        Boolean validStart = idx == 0               || Char.IsWhiteSpace( haystack[idx - 1] );
        Boolean validEnd   = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
        if( validStart && validEnd ) return true;

        idx++;
    }
    return false;
}

Approach 3: Using a CSS Selector library:

HtmlAgilityPack is somewhat stagnated doesn't support .querySelector and .querySelectorAll, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzler and CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll, so you can use it like so:

private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {

    return doc.QuerySelectorAll( "div.float" );
}

With runtime-defined classes:

private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {

    String selector = "div." + String.Join( ".", classNames );

    return doc.QuerySelectorAll( selector  );
}
Dai
  • 141,631
  • 28
  • 261
  • 374
  • Wont this cause only Divs to be found? What if I add that class to a – Adam Dec 10 '12 at 14:35
  • 1
    Then remove the "div" predicate. – Dai Dec 10 '12 at 20:49
  • can you just do .Descendants("") ? – Adam Dec 10 '12 at 21:05
  • 14
    `Contains()` doesnt exist on the attribute so replace `d.Attributes["class"].Contains("float")` with `d.Attributes["class"].Value.Split(' ').Any(b => b.Equals("float"))` – maxp Jan 08 '14 at 10:13
  • Oh, `Contains` works for me because I wrote my own extension method. – Dai Mar 04 '14 at 19:08
  • 2
    If there were a class named `floating` then `Value.Contains("float")` would also match that – tic Sep 17 '15 at 14:49
  • Web scraping spree :-) – Carlo Luther Jul 29 '16 at 21:53
  • I spent hours trying to get `HtmlAgilityPack` to work and this was the only example that I actually got to work for attributes. I'm not sure if things changed with different versions, but my install does not have `SelectNodes` and `HtmlNavigator`. – doubleJ Nov 19 '16 at 15:16
  • @Dai re: the RegEx comment in CheapClassListContains(). Why not just create the RegEx as a class level static element? Thank you for your answer. – Robert Oschler Mar 18 '18 at 02:35
  • 1
    @RobertOschler `CheapClassListContains` is potentially cheaper than a regex and implements the same logic - but yes, that is also an option. – Dai Mar 18 '18 at 02:36
93

You can solve your issue by using the 'contains' function within your Xpath query, as below:

var allElementsWithClassFloat = 
   _doc.DocumentNode.SelectNodes("//*[contains(@class,'float')]")

To reuse this in a function do something similar to the following:

string classToFind = "float";    
var allElementsWithClassFloat = 
   _doc.DocumentNode.SelectNodes(string.Format("//*[contains(@class,'{0}')]", classToFind));
Anthony Horne
  • 2,522
  • 2
  • 29
  • 51
Ryan McCarty
  • 939
  • 5
  • 2
6

I used this extension method a lot in my project. Hope it will help one of you guys.

public static bool HasClass(this HtmlNode node, params string[] classValueArray)
    {
        var classValue = node.GetAttributeValue("class", "");
        var classValues = classValue.Split(' ');
        return classValueArray.All(c => classValues.Contains(c));
    }
Hung Cao
  • 3,130
  • 3
  • 20
  • 29
  • 3
    Don't use `ToLower()` when what you really want is to IgnoreCase comparison. Passing `StringComparison.CultureIgnoreCase` is cleaner and shows a more explicit intent. – Pauli Østerø Jan 05 '17 at 20:50
0
public static List<HtmlNode> GetTagsWithClass(string html,List<string> @class)
    {
        // LoadHtml(html);           
        var result = htmlDocument.DocumentNode.Descendants()
            .Where(x =>x.Attributes.Contains("class") && @class.Contains(x.Attributes["class"].Value)).ToList();          
        return result;
    }      
hadi.sh
  • 129
  • 1
  • 4
0

If you looking for class in some tag (like or any other). Try this one

 var spans = doc.DocumentNode.SelectNodes("//span"); //or other tag or all nodes

 var span_with_class = spans.Where(_ => _.Attributes["class"].Value.Split(' ').Any(b => b.Equals("someClass")));
  • 1
    Code-only answers are discouraged. Answers have more long-term value if they come with explanations about how/why the code solves the problem. – tdy Oct 20 '21 at 18:13
-6

You can use the following script:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => 
    d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("float")
);
Igor Kustov
  • 3,228
  • 2
  • 34
  • 31