3

I have discovered that BeautifulSoup 4 appears to escape some characters in inline javascript:

>>> print s
<DOCTYPE html>
<html>
<body>
<h1>Test page</h1>
<script type="text/javascript">
//<!--
if (4 > 3 && 3 < 4) {
        console.log("js working");
}
//-->
</script>
</body>
</html>
>>> import bs4
>>> soup = bs4.BeautifulSoup(s, 'html5lib')
>>> print soup
<html><head></head><body><doctype html="">


<h1>Test page</h1>
<script type="text/javascript">
//&lt;!--
if (4 &gt; 3 &amp;&amp; 3 &lt; 4) {
        console.log("js working");
}
//--&gt;
</script>

</doctype></body></html>
>>> print soup.prettify()
<html>
 <head>
 </head>
 <body>
  <doctype html="">
   <h1>
    Test page
   </h1>
   <script type="text/javascript">
    //&lt;!--
if (4 &gt; 3 &amp;&amp; 3 &lt; 4) {
        console.log("js working");
}
//--&gt;
   </script>
  </doctype>
 </body>
</html>

In case it's lost in the above, the key problem is that:

if (4 > 3 && 3 < 4)

gets converted into:

if (4 &gt; 3 &amp;&amp; 3 &lt; 4)

which doesn't work particularly well ...

I have tried the included formatters in the prettify() method, with no success.

So any idea how to stop the javascript being escaped? Or how to unescape it before outputting it?

Hamish Downer
  • 16,603
  • 16
  • 90
  • 84
  • Note that it should be ` – Martijn Pieters Apr 11 '13 at 10:12
  • The `` comments are actually useless in Javascript because Javascript can make use of `<`, `&` and `>` characters just fine. You should really use `<![CDATA[` and `]]>` instead to 'escape' the content of a ` – Martijn Pieters Apr 11 '13 at 10:18
  • This is a problem with the parser; it doesn't see the content as a comment, so the `` prefix and suffix are escaped. – Martijn Pieters Apr 11 '13 at 10:59
  • @MartijnPieters - whichever version of the comment I use, or if I don't use the comment, the key problem is the `if` statement contents being escaped. I'll update the question to make that clearer. – Hamish Downer Apr 11 '13 at 11:05

1 Answers1

2

Edit: This bug was fixed in 4.2.0, released on 30-May-2013.

>>> import bs4
>>> bs4.__version__
'4.2.0'
>> s = """<DOCTYPE html>
... <html>
... <body>
... <h1>Test page</h1>
... <script type="text/javascript">
... //<!--
... if (4 > 3 && 3 < 4) {
...     console.log("js working");
... }
... //-->
... </script>
... </body>
... </html>
... """
>>> soup = bs4.BeautifulSoup(s)
>>> print soup
<html><body><doctype html="">
<h1>Test page</h1>
<script type="text/javascript">
//<!--
if (4 > 3 && 3 < 4) {
    console.log("js working");
}
//-->
</script>
</doctype></body></html>

If you are stuck using < 4.2 for some reason, I found this StackOverflow answer. It seems to me you can do something similar: Walk the tree, using prettyify() on all tags except for the script tag that you somehow emit without escaping.

Community
  • 1
  • 1
stvsmth
  • 3,578
  • 2
  • 28
  • 30