1

I'm trying to identify DOM elements by class name, but I'm not able to use the pattern.web as described in the docs (I'm also running code that I've used before, so it did work at some point).

from pattern.web import DOM

html = """<html><head><title>pattern.web | CLiPS</title></head>
<body>
  <div class="class1 class2 class3">
    <form action="/pages/pattern-web"  accept-charset="UTF-8" method="post" id="search-block-form">
      <div>
        <label for="edit-search-block-form-1">Search this site: </label>
      </div>
    </form>
  </div>
</body></html>"""

dom = DOM(html)
print "Search Results by Method:"
print 'tag[attr="value"] Notation Results:'
print dom('div[class="class1 class2 class3"]')
print 
print 'tag.class Notation Results:'
print dom('div.class1')
print
print 'By class, no tag results:'
print dom.by_class('class1')
print 
print 'Looping through all divs and printing matching results:'
for i in dom('div'):
    if 'class' in i.attrs and i.attrs['class'] == 'class1 class2 class3':
        print i.attrs

Note that (Element and DOM functions are interchangeable and give the same results). The result is this:

Search Results by Method:
tag[attr="value"] Notation Results:
[]

tag.class Notation Results:
[]

By class, no tag results:
[Element(tag='div')]

Looping through all divs and printing matching results:
{u'class': u'class1 class2 class3'}

As you can see, looking it up using the tag.class notation and the tag[attr="value"] notation both give empty results, but by_class returns one result. Clearly elements with those attributes exist. How do I search for all the divs that have all 3 classes?

In the past, I've been able to search using dom('div.class1.class2.class3') to identify a div with all 3 classes. Not only does this not work, it's also giving me unicode errors (it appears that the second period causes a unicode error) : TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

A User
  • 812
  • 2
  • 7
  • 21
  • I tried that actually, they behave the same / did not make a difference. I'm starting to think the latest version of pattern.web does not support search by multiple class names. – A User Sep 25 '18 at 17:37
  • Real data would be uniquely identifying, I can't do that @stovfl. What I've shared is real data just replaced to be anonymous. – A User Sep 25 '18 at 17:54
  • Thanks for your help @stovfl. I updated the code and expected results, and clarified one of the things that was causing an error. – A User Sep 25 '18 at 18:23
  • That would only work in this example. If I had `
    ` I'd only want the second one. With broader-than-desired criteria, if there's more than one match I can't be sure that it's what I"m looking for.
    – A User Sep 25 '18 at 18:51
  • I assume you have tried `dom.by_class('class1 class2 class3')` already. You want to use **CSS selector**, according to the doc you have to do 1. `element = Element(html)` 2. `element('div[class="class1 class2 class3"]')`. I'm off! – stovfl Sep 25 '18 at 19:23
  • Yep, tried all. – A User Sep 25 '18 at 20:56

1 Answers1

0

Question: In the past, I've been able to search using dom('div.class1.class2.class3') to identify a div with all 3 classes.


Reading the Source github.com/clips/pattern/blob/master/pattern/web,
found, it's only a wrapper using Beautiful Soup.

# Beautiful Soup is wrapped in DOM, Element and Text classes, resembling the Javascript DOM.
# Beautiful Soup can also be used directly


It's a known Issue, see SO: Beautiful soup find_all doesn't find CSS selector with multiple classes

The workaround ist to use .select(...) instead of .find_all(...),
didn't find .select(...) in pattern.web

For example:

from bs4 import BeautifulSoup

html = """<html><head><title>pattern.web | CLiPS</title></head>
  <body>
    <div class="class1 class4">
      <form action="/pages/pattern-web"  accept-charset="UTF-8" method="post" id="search-block-form">
        <div class="class1 class2 class3">
          <label for="edit-search-block-form-1">Search this site: </label>
        </div>
      </form>
    </div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
div = soup.select('div.class1.class2')
print("{}".format(div))

Output:

[<div class="class1 class2 class3">
<label for="edit-search-block-form-1">Search this site: </label>
</div>]

Question: it's also giving me unicode errors (it appears that the second period causes a unicode error) :

TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

It's unknown, if this TypeError is from pattern.web or Beautiful Soup.
According to this SO:descriptor-join-requires-a-unicode-object-but-received-a-str it's a standard Python message.


Using pattern.web from GitHub, the results are as expected:

from pattern.web import Element

elements = Element(html)
print("Search Results by Method:")
print('tag[attr="value"] Notation\tResults:{}'
    .format(elements('div[class="class1 class2 class3"]')))

print('tag.class Notation \t\t\tResults:{}'
    .format(elements('div.class1.class2.class3')))

print('By class, no tag \t\t\tResults:{}'
    .format(elements.by_class('class1 class2 class3')))

print('Looping through all divs and printing matching results:')
for i in elements('div'):
    if 'class' in i.attrs:
        if " ".join(i.attrs['class']) == 'class1 class2 class3':
            print("\tmatch:{}".format(i.attrs))

Output:

Search Results by Method:
tag[attr="value"] Notation  Results:{'class': ['class1', 'class2', 'class3']}
tag.class Notation          Results:{'class': ['class1', 'class2', 'class3']}
By class, no tag            Results:{'class': ['class1', 'class2', 'class3']}
Looping through all divs and printing matching results:
    match:{'class': ['class1', 'class2', 'class3']}

Tested with Python:3.5.3 - pattern.web:3.6 - bs4:4.5.3

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • Very helpful thank you. I didn't know pattern.web was built on top of it, it was introduced to me as an alternative! – A User Sep 25 '18 at 20:58
  • @AUser: Updated my Answer, using `pattern.web` from GitHub, i got results as expected. Your `TypeError` seems to be version dependent. – stovfl Sep 26 '18 at 14:36