You cannot use BeautifulSoup nor any HTML parser to read web pages. You are never guaranteed that web page is a well formed document. Let me explain what is happening in this given case.
On that page there is this INLINE javascript:
var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";
You can see that it's creating a string that will put a script tag onto the page. Now, if you're an HTML parser, this is a very tricky thing to deal with. You go along reading your tokens when suddenly you hit a <script>
tag. Now, unfortunately, if you did this:
<script>
alert('hello');
<script>
alert('goodby');
Most parsers would say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgot to close the first one! And the parser would think both are valid scripts.
So, in this case, BeautifulSoup sees a <script>
tag, and even though it's inside a javascript string, it looks like it could be a valid starting tag, and BeautifulSoup has a seizure, as well it should.
If you look at the string again, you can see they do this interesting piece of work:
... "</" + "script>";
This seems odd right? Wouldn't it be better to just do str = " ... </script>"
without doing an extra string concatination? This is actually a common trick (by silly people who write script tags as strings, a bad practice) to make the parser NOT break. Because if you do this:
var a = '</script>';
in an inline script, the parser will come along and really just see </script>
and think the whole script tag has ended, and will throw up the rest of the contents of that script tag onto the page as plain text. This is because you can technically put a closing script tag anywhere, even if your JS syntax is invalid. From a parser point of view, it's better to get out of the script tag early rather than try to render your html code as javascript.
So, you can't use a regular HTML parser to parse web pages. It's a very, very dangerous game. There is no guarantee you'll get well formed HTML. Depending on what you're trying to do, you could read the content of the page with a regex, or try getting a fully rendered page content with a headless browser