How to elegantly get top level text of a html td with BeautifulSoup4?

Question

Below is a simple html segment to parse with beautifulsoup4 and I hope to extract the top level raw text hello.

mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')

And I've tried several intuitive ways but without expected results:

mysoup.text            # u'helloworld'
mysoup.contents        # [<html><body><td>hello<script type="text/javascript">world</script></td></body></html>]
list(mysoup.strings)   # [u'hello ', u'world']

So how to achieve this goal?

score 0 · Accepted Answer · edited May 23 '17 at 12:06

0

First, get a reference to the td node. Then, iterate through its children and see which of them are strings:

from bs4 import BeautifulSoup
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
td = mysoup.find('td')
print [s for s in td.children if isinstance(s, basestring)]

edited May 23 '17 at 12:06

Community

1
1

answered Apr 16 '15 at 08:03

Cristian Lupascu

39,078
16
100
137

How to elegantly get top level text of a html td with BeautifulSoup4?

1 Answers1