Simple example BeautifulSoup Python

Question

I was working a simple example with BeautifulSoup, but I was getting weird resutls.

Here is my code:

soup = BeautifulSoup(page)
print soup.prettify()
stuff = soup.findAll('td', attrs={'class' : 'prodSpecAtribtue'})
print stuff

When I print I get:

[]

Not sure what's happening, because when I printed soup on the screen I got proper data. Basically I am searching for values in found in tag <td> under class prodSpecAtribtue

No, you would either get `[]` or a list with matches. You would **not** get `{}`. — Martijn Pieters, May 21 '13 at 21:14
Can you show us some sample HTML snippet that still produces this result? — Martijn Pieters, May 21 '13 at 21:15
You do realize that you misspelled `prodSpecAtribtue`, right? I'd expect it to be spelled `prodSpecAttribute` instead. — Martijn Pieters, May 21 '13 at 21:16
http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1 — James Hallen, May 21 '13 at 21:19
There is **no** `prodSpecAttribute` *or* `prodSpecAtribtue` class anywhere in that document. There is not even a ``. — Martijn Pieters, May 21 '13 at 21:20
When I print the result of `soup.findAll('td', attrs={'class' : 'prodSpecAtribtue'})` I get `[]` (not `{}`), or the empty list, which is expected as there are not `` elements anywhere on the page you linked to. — Martijn Pieters, May 21 '13 at 21:27
The page uses AJAX queries to fill the table dynamically. Use your browser developer tools to detect what URLs are being requested asynchronously, then load *those*. — Martijn Pieters, May 21 '13 at 21:33
@MartijnPieters I'm really stupid, I gave the wrong link. This is the correct one:http://www.cmegroup.com/trading/interest-rates/stir/eurodollar_contract_specifications.html — James Hallen, May 21 '13 at 21:39

score 1 · Accepted Answer · answered May 21 '13 at 21:43

1

You misspelled the class name:

soup.findAll('td', attrs={'class': 'prodSpecAtribute'})

works fine. That's prodSpecAtribute, not prodSpecAtribtue. That's still misspelled, but slightly less so.

answered May 21 '13 at 21:43

Martijn Pieters

1,048,767
296
4,058
3,343

Do you know an efficient way from using BeautifulSoup to extract data of the format: and . I figured regex is the simplest solution. – James Hallen May 21 '13 at 21:54
@JamesHallen: Pick out all `td` with a colspan attribute: `.findAll('td', colspan=True)`, pick out all `td` with a colspan attribute with values 4 *or* 5: `.findAll('td', colspan=['4', '5'])` – Martijn Pieters May 21 '13 at 21:55
Having a bit of trouble, using the same idea, I should be able to get all the data inside the tages `` using `soup = BeautifulSoup(page)` `stuff = soup.findAll('td', colspan = ['4', '5'])`? Or am I missing something here? – James Hallen May 21 '13 at 22:26
@JamesHallen: That `.findall()` returns all matching `td` elements. To get the text data *inside* of those elements, use `.string` on each element.. – Martijn Pieters May 21 '13 at 22:28
I use something like: `elements = stuff.getText()`, and I use a regex expression to get rid of some unwanted elements: `elements = re.sub('&\w+;', '', elements)`, do you think this is a good idea? – James Hallen May 21 '13 at 22:36
@JamesHallen: For BeautifulSoup 3, you want to use `.findAll(text=True)` to find all text instead, perhaps joining the result with `' '.join(stuff.findAll(text=True))`. – Martijn Pieters May 21 '13 at 22:45
@JamesHallen: BeautifulSoup 3 is not that great at replacing `&...;` html entities with their proper Unicode codepoints. Upgrade to BeautifulSoup 4, or [convert them manually afterwards](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python). – Martijn Pieters May 21 '13 at 22:49
I can't use `.findAll(text=True)` along with `.findall('td', colspan = [...])` right? – James Hallen May 21 '13 at 23:12
@JamesHallen: Sure you can. Loop over the elements from the `findAll('td'..)`, each element you find you can call `.findAll(text=True)` on: `for cell in soup.findAll('td', colspan = ['4', '5']): print ' '.join(cell.findAll(text=True))`. – Martijn Pieters May 21 '13 at 23:14
If you don't mind, I don't really understand how your `' '.join(cell.findAll())` works. We use `' '.join` to append strings, so what are we appending here? – James Hallen May 21 '13 at 23:26
@JamesHallen: `cell.findAll(text=True)` returns a sequence of text. `' '.join()` joins that sequence into one long string. – Martijn Pieters May 21 '13 at 23:29
so somehing like `"this" "is" "a" "string"` will become `"this is is a string"` – James Hallen May 21 '13 at 23:48
@JamesHallen: Something like `this is a HTML-formatted string` becomes `"this is a HTML-formatted string"`; the `.findAll(text=True)` call finds all text elements, recursively. – Martijn Pieters May 22 '13 at 06:46

Simple example BeautifulSoup Python

1 Answers1