How to find all text inside
elements in an HTML page using BeautifulSoup

Question

I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?

Check out http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string for removing unicode in Python. — silent1mezzo, Apr 11 '12 at 21:00

score 14 · Answer 1 · answered Apr 11 '12 at 20:52

14

soup.findAll('p')

here is a reference

answered Apr 11 '12 at 20:52

0x90

39,472
36
165
245

score 6 · Accepted Answer · edited May 23 '17 at 10:32

6

Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Reference

edited May 23 '17 at 10:32

Community

1
1

answered Apr 11 '12 at 20:56

silent1mezzo

2,814
4
26
46

How to find all text inside
elements in an HTML page using BeautifulSoup

2 Answers2

Linked

How to find all text inside elements in an HTML page using BeautifulSoup

2 Answers2

Linked

How to find all text inside
elements in an HTML page using BeautifulSoup