9

I am using python2.7.6, urllib2, and BeautifulSoup

to extract html from a website and store in a variable.

How can I show just the html contents of a div with an id by using beautifulsoup?

<div id='theDiv'>
<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

would be

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user8028
  • 463
  • 2
  • 6
  • 9

2 Answers2

18

Join the elements of div tag's .contents:

from bs4 import BeautifulSoup

data = """
<div id='theDiv'>
    <p>div content</p>
    <p>div stuff</p>
    <p>div thing</p>
</div>
"""

soup = BeautifulSoup(data)
div = soup.find('div', id='theDiv')
print ''.join(map(str, div.contents))

Prints:

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • That appears to work! can you explain what is going on with `print ''.join(map(str, div.contents))` – user8028 Sep 02 '14 at 03:37
  • @user8028 sure, `contents` actually contains all of the tag's children that can be represented as a string, or as a `Tag` class instance. Applying `map(str, ...)` helps to cast every child to string. Hope that helps. – alecxe Sep 02 '14 at 03:38
  • i have a special character (€) in the content of the div. how can I encode this to ascii so it is printable to terminal or writable to a file? I always receive error `UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 31: ordinal not in range(128)` – Burcardo May 03 '18 at 11:57
1

Since version 4.0.1 there's a function decode_contents():

>>> soup = BeautifulSoup("""
<div id='theDiv'>
<p>div content</p>
<p>div stuff</p>
<p>div thing</p>
""")

>>> print(soup.div.decode_contents())

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

More details in a solution to this question: https://stackoverflow.com/a/18602241/237105

Antony Hatchkins
  • 31,947
  • 10
  • 111
  • 111