Beautiful soup question

Question

I want to fetch specific rows in an HTML document

The rows have the following attributes set: bgcolor and vallign

Here is a snippet of the HTML table:

<table>
   <tbody>
      <tr bgcolor="#f01234" valign="top">
        <!--- td's follow ... -->
      </tr>
      <tr bgcolor="#c01234" valign="top">
        <!--- td's follow ... -->
      </tr>
   </tbody>
</table>

I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.

Does anyone know what tp bass to findAll() to match the rows I want?

Modified question in light of previous struggles by others. Now using BeautifulSoup — skyeagle, Jan 11 '11 at 14:56

score 5 · Answer 1 · edited May 23 '17 at 09:58

5

Don't use regex to parse html. Use a html parser

import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(@bgcolor='#f01234' or @bgcolor='#c01234') "
    "and @valign='top']")
print result

That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...

Obligatory link:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 09:58

Community

1
1

answered Jan 11 '11 at 14:53

nosklo

217,122
57
293
297

Ok, now that you changed your question from regex to `BeautifulSoup`, my answer stands as an example to do with `lxml.html`. I think it is way better than `BeautifulSoup`. – nosklo Jan 11 '11 at 14:56
Oops. Modified question to use BeautifulSoup instead. I already use BS with some of my scripts but never heard of lxml module. Do you know what to pass to BS to fetch the rows (I'm a bit hesitant to install/learn yet another library) – skyeagle Jan 11 '11 at 14:58
@skyeagle: Well, I dumped `BeautifulSoup` long ago for `lxml.html` as it is way faster and **supports xpath!!!** so, this answer is better left here as a hint. – nosklo Jan 11 '11 at 15:04
Also, `BeautifulSoup` is unmaintained – nosklo Jan 11 '11 at 15:04
AFAIK, BeautifulSoup does better with bad HTML (at least, that's what it's designed for). I've not tried using xpath, but I like BeautifulSoup's easy `soup.body.h1` style navigation. – Thomas K Jan 11 '11 at 16:10
1

@Thomas K: `lxml.html` is better than `BeautifulSoup` for bad html. You can use that same style of navigation by using `lxml.objectify` but I don't recommend it, since just using xpath is easier and simpler. – nosklo Jan 11 '11 at 17:36
@nosklo: Yet the documentation for lxml.html specifically suggests BeautifulSoup as a parser for really bad html: http://codespeak.net/lxml/lxmlhtml.html#really-broken-pages (and I'd have to disagree that xpath is easier. More powerful, maybe.) – Thomas K Jan 12 '11 at 11:19

score 4 · Answer 2 · answered Jan 11 '11 at 15:26

4

Something like

soup.findAll('tr', attrs={'bgcolor': re.compile(r'#f01234|#c01234'), 'valign': 'top'})

answered Jan 11 '11 at 15:26

GabiMe

18,105
28
76
113

Beautiful soup question

2 Answers2