2

I'm trying to use a regex to match tags with class="calendar-days-list2" but not class="calendar-days-list2 prev-next-month". I loaded up a sample piece of HTML with tags containing both options.

When I search the sample HTML using re.findall(), the regex matches as I would like. When I use that sample regex in beautifulsoup, it returns both the wanted and the unwanted class. I don't understand why this is, any thoughts? Thanks!

html = '''<td id="pagestructure_0_pagecontent_0_calendar1_2016_1_7_0" class="calendar-days-list2" width="14%">
       <span class="date-number">7</span>
            <p>
              <img src="/wac/wacassets/images/icons/h1.gif" border="0">
              <a href="http://www.woodruffcenter.org/Commerce/MuseumAdmissions?performanceId=86514">Special Exhibitions</a>
              10:00 AM
            </p>

          <td id="pagestructure_0_pagecontent_0_calendar1_2015_11_29_1"    class="calendar-days-list2 prev-next-month" width="14%"></td>
       '''

soup = BeautifulSoup(html)
# WORKS
print re.findall(r"(calendar\-days\-list2)(?!\sprev\-next\-month)",html), "\n\n"

regex = re.compile(r"(calendar\-days\-list2)(?!\sprev\-next\-month)")
# DOESN'T WORK
tds = soup.find_all("td", {"class": regex})
print tds

output:

# re.findall                              
['calendar-days-list2'] 

# soup.find_all
[<td class="calendar-days-list2"     id="pagestructure_0_pagecontent_0_calendar1_2016_1_7_0" width="14%">
<span class="date-number">7</span>
<p>
<img border="0" src="/wac/wacassets/images/icons/h1.gif"/>
<a href="http://www.woodruffcenter.org/Commerce/MuseumAdmissions?     performanceId=86514">Special Exhibitions</a>
        10:00 AM
    </p>
</td>, <td class="calendar-days-list2 prev-next-month"       id="pagestructure_0_pagecontent_0_calendar1_2015_11_29_1" width="14%"></td>]

`

  • what is your expected result? – Elixir Techne Dec 24 '15 at 06:55
  • to only return tags with the class="calendar-days-list2", such as the first element in the list returned by soup.find_all, and not tags with the class="calendar-days-list2 prev-next-month" like the second one it returns – Spencer Smolen Dec 24 '15 at 06:58

1 Answers1

1
regex = re.compile(r"(calendar\-days\-list2)(?!\sprev\-next\-month)")
# DOESN'T WORK
tds = soup.find_all("td", {"class": regex})

This is not working since the regular expression is applied to every class value separately and not to the entire attribute value. This is because class is a special multi-valued attribute. There were several related to the problem posts recently:

Probably the simplest approach is to go with a CSS selector to make a full class attribute match:

soup.select('[class="calendar-days-list2"]')
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195