1

I was working a simple example with BeautifulSoup, but I was getting weird resutls.

Here is my code:

soup = BeautifulSoup(page)
print soup.prettify()
stuff = soup.findAll('td', attrs={'class' : 'prodSpecAtribtue'})
print stuff

When I print I get:

[]

Not sure what's happening, because when I printed soup on the screen I got proper data. Basically I am searching for values in found in tag <td> under class prodSpecAtribtue

James Hallen
  • 4,534
  • 4
  • 23
  • 28

1 Answers1

1

You misspelled the class name:

soup.findAll('td', attrs={'class': 'prodSpecAtribute'})

works fine. That's prodSpecAtribute, not prodSpecAtribtue. That's still misspelled, but slightly less so.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Do you know an efficient way from using BeautifulSoup to extract data of the format: and . I figured regex is the simplest solution. – James Hallen May 21 '13 at 21:54
  • @JamesHallen: Pick out all `td` with a colspan attribute: `.findAll('td', colspan=True)`, pick out all `td` with a colspan attribute with values 4 *or* 5: `.findAll('td', colspan=['4', '5'])` – Martijn Pieters May 21 '13 at 21:55
  • Having a bit of trouble, using the same idea, I should be able to get all the data inside the tages `` using `soup = BeautifulSoup(page)` `stuff = soup.findAll('td', colspan = ['4', '5'])`? Or am I missing something here? – James Hallen May 21 '13 at 22:26
  • @JamesHallen: That `.findall()` returns all matching `td` elements. To get the text data *inside* of those elements, use `.string` on each element.. – Martijn Pieters May 21 '13 at 22:28
  • I use something like: `elements = stuff.getText()`, and I use a regex expression to get rid of some unwanted elements: `elements = re.sub('&\w+;', '', elements)`, do you think this is a good idea? – James Hallen May 21 '13 at 22:36
  • @JamesHallen: For BeautifulSoup 3, you want to use `.findAll(text=True)` to find all text instead, perhaps joining the result with `' '.join(stuff.findAll(text=True))`. – Martijn Pieters May 21 '13 at 22:45
  • @JamesHallen: BeautifulSoup 3 is not that great at replacing `&...;` html entities with their proper Unicode codepoints. Upgrade to BeautifulSoup 4, or [convert them manually afterwards](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python). – Martijn Pieters May 21 '13 at 22:49
  • I can't use `.findAll(text=True)` along with `.findall('td', colspan = [...])` right? – James Hallen May 21 '13 at 23:12
  • @JamesHallen: Sure you can. Loop over the elements from the `findAll('td'..)`, each element you find you can call `.findAll(text=True)` on: `for cell in soup.findAll('td', colspan = ['4', '5']): print ' '.join(cell.findAll(text=True))`. – Martijn Pieters May 21 '13 at 23:14
  • If you don't mind, I don't really understand how your `' '.join(cell.findAll())` works. We use `' '.join` to append strings, so what are we appending here? – James Hallen May 21 '13 at 23:26
  • @JamesHallen: `cell.findAll(text=True)` returns a sequence of text. `' '.join()` joins that sequence into one long string. – Martijn Pieters May 21 '13 at 23:29
  • so somehing like `"this" "is" "a" "string"` will become `"this is is a string"` – James Hallen May 21 '13 at 23:48
  • @JamesHallen: Something like `this is a HTML-formatted string` becomes `"this is a HTML-formatted string"`; the `.findAll(text=True)` call finds all text elements, recursively. – Martijn Pieters May 22 '13 at 06:46