3

I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?

John Y
  • 14,123
  • 2
  • 48
  • 72
rarora7777
  • 79
  • 1
  • 1
  • 5
  • 1
    Check out http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string for removing unicode in Python. – silent1mezzo Apr 11 '12 at 21:00

2 Answers2

14
soup.findAll('p')

here is a reference

0x90
  • 39,472
  • 36
  • 165
  • 245
6

Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Reference

Community
  • 1
  • 1
silent1mezzo
  • 2,814
  • 4
  • 26
  • 46