2

If have the following chunk of html:

chunk = '<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>'

When I do the following:

from bs4 import BeautifulSoup
soup = BeautifulSoup(chunk)

The chunk turns into this:

>>> soup
<html><body><p>BLA bla bla html... </p><div>Copyright 2014 Someone</div></body></html>

The paragraph tag gets closed early and the div is pulled outside of it.

I was surprised by this. Is this expected behavior for BeautifulSoup and if so, can anyone explain why it's doing this?

EDIT: Just to note, I realize that this html is invalid, but I didn't realize BeautifulSoup would edit invalid html to this degree. Here's a related SO question on invalid HTML (div instead a p tag)

Community
  • 1
  • 1
erewok
  • 7,555
  • 3
  • 33
  • 45

1 Answers1

3

The HTML you presented is not a valid well-formed one. In this case, as stated in the documentation:

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results.

So, the behavior really depends on the underlying parser used by BeautifulSoup. And, since you haven't specified it explicitly, BeautifulSoup chooses it according to the ranking:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

This is what different parsers tries to do with the html you have provided:

>>> from bs4 import BeautifulSoup
>>> chunk = '<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>'

# html.parser
>>> BeautifulSoup(chunk, 'html.parser')
<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p> 

# html5lib
>>> BeautifulSoup(chunk, 'html5lib')
<html><head></head><body><p>BLA bla bla html... </p><div>Copyright 2014 NPR</div><p></p></body></html>

# lxml
>>> BeautifulSoup(chunk, 'lxml')
<html><body><p>BLA bla bla html... </p><div>Copyright 2014 NPR</div></body></html>

# xml
>>> BeautifulSoup(chunk, 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>

According to the output, you have lxml installed in this particular python environment, and BeautifulSoup uses it as an underlying parser since you haven't specified it explicitly.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Followup: this is a really nicely written response and I used the information in it today to solve a problem I've had for a few days. Thanks again! – erewok Aug 14 '14 at 16:37