1

I have been trying to parse text elements stored in between <td> tags, for example:

<tr>
<td>Trading Hours</td>
<td><b>Monday</b> <br />
London - 23:00 Sunday - 23:00 Monday<br />
New York - 18:00 Sunday - 18:00 Monday<br />
Chicago - 17:00 Sunday - 17:00 Monday<br />
<br />
<b>Tuesday-Friday</b> <br />
London - 01:00 - 23:00<br />
New York - 20:00 - 18:00<br />
Chicago - 19:00 - 17:00<br />
</td>
</tr>

In this simple example, there only 2 <td> tags and suppose a variable tr stores entire block of html code. My logic for extracting text is as follow (without any <tr> or <br> tags):

for td in tr.findAll('td'):
    row.append((td.find('td', text = True)).strip().strip('\n'))

Problem: My for loop recognizes the first <td> tag, but not the second. How can I improve this?

Max Kim
  • 1,114
  • 6
  • 15
  • 28
  • possible duplicate of [Parsing HTML Python](http://stackoverflow.com/questions/11709079/parsing-html-python). Clarify if I am wrong. – Johan Lundberg Jun 16 '13 at 19:05

1 Answers1

1

text=True tells BeautifulSoup to look for elements with text. If you want to get the text, you need to use .get_text():

td.find('td', text=True).get_text(strip=True)
Blender
  • 289,723
  • 53
  • 439
  • 496
  • Even before getting the texts, when I do: `for td in tr.findAll('td'): print td`, it'll only print the first ` ... ` tag and not the second. I was trying to figure out why that happens. – Max Kim Jun 16 '13 at 19:16
  • @MaxKim: Where are you getting this HTML? It's probably malformed. – Blender Jun 16 '13 at 19:17
  • @MaxKim, what version of BeautifulSoup are you using? I used the same html that you posted and it found both td's elements. The html you posted looks well formed to me. – Justin Peel Jun 16 '13 at 19:18
  • I'm using Beautifulsoup 3, I get an error with `.get_text()`: `'NavigableString' object has no attribute 'get_text'` – Max Kim Jun 16 '13 at 19:20