22

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.

Philipp Zedler
  • 1,660
  • 1
  • 17
  • 36
  • Despite all the answers, I still find the .hiddden=True approach the cleanest one. Another hack, if a string result will suffice, would be to truncate the body tags: `str(soup.body)[6:-7]` or `soup.body.prettify()[6:-7]` – addmoss Oct 05 '20 at 07:03

2 Answers2

38

Do you mean getting everything inbetween the body tags?

In this case you can use :

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
Jeremy
  • 1,746
  • 1
  • 15
  • 20
Azwr
  • 774
  • 8
  • 13
  • Thanks! When I have two paragraphs, should I use something like `''.join(['%s' % x for x in soup.body.findChildren()])`, or is there a better way? – Philipp Zedler Jan 30 '14 at 10:12
  • 5
    I had some issues using findChildren where some things appearing redundantly, as they are nested withing multiple layers and were added for each containing layer. To get the contents from the body as it is in the original without any redundancy or weirdness I used `pagefilling = ''.join(['%s' % x for x in soup.body.contents])` – kpie Jul 27 '16 at 17:22
  • 2
    body.findChildren(recursive=False); helps you not to get nested elements twice. – alizx Sep 08 '18 at 00:16
4

I've found the easiest way to get just the contents of the body is to unwrap() your contents from inside the body tags.

>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>

To be more efficient and reusable you could put those undesirable elements in a list and loop through them...

>>> def get_body_contents(html):
...  soup = BeautifulSoup(html, "html5lib")
...  for attr in ['head','html','body']:
...    if hasattr(soup, attr):
...      getattr(soup, attr).unwrap()
...  return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>
Jeremy
  • 1,746
  • 1
  • 15
  • 20