python lxml - simply get/check class of HTML element

Question

I use tree.xpath to iterate over all interesting HTML elements but I need to be able to tell whether the current element is part of a certain CSS class or not.

from lxml import html

mypage = """
<div class="otherclass exampleclass">some</div>
<div class="otherclass">things</div>
<div class="exampleclass">are</div>
<div class="otherclass">better</div>
<div>left</div>"""

tree = html.fromstring(mypage)

for item in tree.xpath( "//div" ):
  print("testing")
  #if "exampleclass" in item.getListOfClasses():
  #  print("foo")
  #else:
  #  print("bar")

The overall structure should remain the same.

What is a fast way to check whether or not the current div has the exampleclass class or not?

In above example, item is of lxml.html.HtmlElement class, which has the property classes but I don't understand what this means:

classes
A set-like wrapper around the 'class' attribute.

Get Method:
unreachable.classes(self) - A set-like wrapper around the 'class' attribute.

Set Method:
unreachable.classes(self, classes)

It returns a lxml.html.Classes object, which has a __iter__ method and it turns out iter() works. So I construct this code:

for item in tree.xpath( "//div" )
  match = False
  for classname in iter(item.classes):
    if classname == "exampleclass":
      match = True
  if match:
    print("foo")
  else:
    print("bar")

But I'm hoping there is a more elegant method.

I tried searching for similar questions but all I found were various "how do I get all elements of 'classname'", however I need all divs in the loop, I just want to treat some of them differently.

Padraic Cunningham · Accepted Answer · 2016-09-19T22:40:52.750

There is no need for iter, if "exampleclass" in item.classes: does the exact same thing, only more efficiently.

from lxml import html

mypage = """
<div class="otherclass exampleclass">some</div>
<div class="otherclass">things</div>
<div class="exampleclass">are</div>
<div class="otherclass">better</div>
<div>left</div>"""

tree = html.fromstring(mypage)

for item in tree.xpath("//div"):
    if "exampleclass" in item.classes:
        print("foo")

The difference is calling iter on a set makes the lookup linear so definitely not an efficient way to search a set, not much difference here but in some cases there would be a monumental diffrence:

In [1]: st = set(range(1000000))

In [2]: timeit 100000 in st
10000000 loops, best of 3: 51.4 ns per loop

In [3]: timeit 100000 in iter(st)
100 loops, best of 3: 1.82 ms per loop

You can also use css selectors using lxml:

for item in tree.cssselect("div.exampleclass"):
    print("foo")

Depending on the case, you may also be able to use contains:

for item in tree.xpath("//div[contains(@class, 'exampleclass')]"):
    print("foo")

Nice, thanks. I can't use selectors though because I need `div`s with and without the class in the loop, updated sample code to hopefully make that clearer. `xpath` `contains` would be problematic in cases where the class `exampleclass-numbertwo` exists, see http://stackoverflow.com/a/1604480/188159 — qubodup, Sep 20 '16 at 10:10
@qubodup, yep, that was why I added *Depending on the case*. Are you looking for more than one class or just that single class? — Padraic Cunningham, Sep 20 '16 at 21:49

score 0 · Answer 2 · answered Sep 19 '16 at 15:18

You can elegantly use the membership test operator in:

for item in tree.xpath( "//div" ):
  if "exampleclass" in iter(item.classes):
    print("foo")

For user-defined classes which do not define __contains__() but do define __iter__(), x in y is true if some value z with x == z is produced while iterating over y.

python lxml - simply get/check class of HTML element

2 Answers2