The HTML you presented is not a valid well-formed one. In this case, as stated in the documentation:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
So, the behavior really depends on the underlying parser used by BeautifulSoup
. And, since you haven't specified it explicitly, BeautifulSoup
chooses it according to the ranking:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
This is what different parsers tries to do with the html you have provided:
>>> from bs4 import BeautifulSoup
>>> chunk = '<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>'
# html.parser
>>> BeautifulSoup(chunk, 'html.parser')
<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>
# html5lib
>>> BeautifulSoup(chunk, 'html5lib')
<html><head></head><body><p>BLA bla bla html... </p><div>Copyright 2014 NPR</div><p></p></body></html>
# lxml
>>> BeautifulSoup(chunk, 'lxml')
<html><body><p>BLA bla bla html... </p><div>Copyright 2014 NPR</div></body></html>
# xml
>>> BeautifulSoup(chunk, 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p>BLA bla bla html... <div>Copyright 2014 NPR</div></p>
According to the output, you have lxml
installed in this particular python environment, and BeautifulSoup
uses it as an underlying parser since you haven't specified it explicitly.