1

I want to fetch specific rows in an HTML document

The rows have the following attributes set: bgcolor and vallign

Here is a snippet of the HTML table:

<table>
   <tbody>
      <tr bgcolor="#f01234" valign="top">
        <!--- td's follow ... -->
      </tr>
      <tr bgcolor="#c01234" valign="top">
        <!--- td's follow ... -->
      </tr>
   </tbody>
</table>

I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.

Does anyone know what tp bass to findAll() to match the rows I want?

moinudin
  • 134,091
  • 45
  • 190
  • 216
skyeagle
  • 3,211
  • 11
  • 39
  • 41

2 Answers2

5

Don't use regex to parse html. Use a html parser

import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(@bgcolor='#f01234' or @bgcolor='#c01234') "
    "and @valign='top']")
print result

That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...

Obligatory link:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Ok, now that you changed your question from regex to `BeautifulSoup`, my answer stands as an example to do with `lxml.html`. I think it is way better than `BeautifulSoup`. – nosklo Jan 11 '11 at 14:56
  • Oops. Modified question to use BeautifulSoup instead. I already use BS with some of my scripts but never heard of lxml module. Do you know what to pass to BS to fetch the rows (I'm a bit hesitant to install/learn yet another library) – skyeagle Jan 11 '11 at 14:58
  • @skyeagle: Well, I dumped `BeautifulSoup` long ago for `lxml.html` as it is way faster and **supports xpath!!!** so, this answer is better left here as a hint. – nosklo Jan 11 '11 at 15:04
  • Also, `BeautifulSoup` is unmaintained – nosklo Jan 11 '11 at 15:04
  • AFAIK, BeautifulSoup does better with bad HTML (at least, that's what it's designed for). I've not tried using xpath, but I like BeautifulSoup's easy `soup.body.h1` style navigation. – Thomas K Jan 11 '11 at 16:10
  • 1
    @Thomas K: `lxml.html` is better than `BeautifulSoup` for bad html. You can use that same style of navigation by using `lxml.objectify` but I don't recommend it, since just using xpath is easier and simpler. – nosklo Jan 11 '11 at 17:36
  • @nosklo: Yet the documentation for lxml.html specifically suggests BeautifulSoup as a parser for really bad html: http://codespeak.net/lxml/lxmlhtml.html#really-broken-pages (and I'd have to disagree that xpath is easier. More powerful, maybe.) – Thomas K Jan 12 '11 at 11:19
4

Something like

soup.findAll('tr', attrs={'bgcolor': re.compile(r'#f01234|#c01234'), 'valign': 'top'})

GabiMe
  • 18,105
  • 28
  • 76
  • 113