Parsing a document with BeautifulSoup while not-parsing the contents of `tags`

Question

I'm writing a blog app with Django. I want to enable comment writers to use some tags (like <strong>, a, et cetera) but disable all others.

In addition, I want to let them put code in <code> tags, and have pygments parse them.

For example, someone might write this comment:

I like this article, but the third code example <em>could have been simpler</em>:

<code lang="c">
#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code>

Problem is, when I parse the comment with BeautifulSoup to strip disallowed HTML tags, it also parses the insides of the <code> blocks, and treats <stdbool.h> and <stdio.h> as if they were HTML tags.

How could I tell BeautifulSoup not to parse the <code> blocks? Maybe there are other HTML parsers better for this job?

See my reference below. That deals with the same problem that you are facing. — pyfunc, Oct 24 '10 at 07:54

Marcelo Cantos · Answer 1 · 2010-10-26T20:48:51.280

1

The problem is that <code> is treated according to the normal rules for HTML markup, and content inside <code> tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).

What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code> and </code> must appear on a line by themselves," and do some pre-processing yourself.

A very simple — though not 100% reliable — technique is to replace ^<code>$ with <code><![CDATA[ and ^</code>$ with ]]></code>. It isn't completely reliable, because if the code block contains ]]>, things will go horribly wrong.
A safer option is to replace dangerous characters inside code blocks (<, > and & probably suffice) with their equivalent character entity references (<, > and &). You can do this by passing each block of code you identify to cgi.escape(code_block).

Once you've completed preprocessing, submit the result to BeautifulSoup as usual.

edited Oct 26 '10 at 20:48

answered Oct 24 '10 at 07:44

Marcelo Cantos

181,030
38
327
365

Option #2 seems like a winner. How would I go about that? Regular expressions, or some sophisticated string processing algorithm? – Dor Oct 26 '10 at 19:07
@Dor: I've amended my answer to cover this. – Marcelo Cantos Oct 26 '10 at 20:49
I've tried this, but obviously cgi.escape expects a string, not a BeautifulSoup tag object :) How can I escape the contents of the tag prior to the parsing? – Dor Oct 26 '10 at 22:40
1

You should extract the text between the `` and `` lines as per my original answer, pass it through `cgi.escape` and concatenate it all back together. Then (and only then) pass the whole thing to BeautifulSoup. – Marcelo Cantos Oct 26 '10 at 22:58
Marcelo Cantos: [_That's the main part of the question - *how?* – @Dor Oct 24 '10 at 15:47_](http://stackoverflow.com/questions/4007434/parsing-a-document-with-beautifulsoup-while-not-parsing-the-contents-of-code-ta/4007459#4007459) – jfs Oct 06 '11 at 23:50

N 1.1 · Answer 2 · 2010-10-24T08:05:00.533

1

From Python wiki

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

edited Oct 24 '10 at 08:05

answered Oct 24 '10 at 07:50

N 1.1

12,418
6
43
61

That way I'd have to write every possible tag, wouldn't I? – Dor Oct 24 '10 at 14:04
@Dor: why? just pass everything inside `` to `cgi.escape` – N 1.1 Oct 24 '10 at 14:11
2

That's the main part of the question - how? – Dor Oct 24 '10 at 15:47

pyfunc · Answer 3 · 2010-10-24T08:10:25.327

Unfortunately, BeautifulSoup can not be blocked to parse the code blocks.

One solution to what you want to achieve is too

1) Remove the code blocks

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) Do the usual parsing to strip the non-allowed tags.

3) Re-insert the code blocks and re-generate the html.

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

I would have answered with some code, but I recently read a blog that does this elegantly.

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

When I first parse the string, BeautifulSoup inserts the closing and tags. So even if I used this technique I'd still get these closing tags in my code blocks. — Dor, Oct 24 '10 at 15:45

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

EDIT:

Use python-markdown2 to process the input, and have users indent the code areas.

>>> print html
I like this article, but the third code example <em>could have been simpler</em>:

    #include <stdbool.h>
    #include <stdio.h>

    int main()
    {
        printf("Hello World\n");
    }

>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include &lt;stdbool.h&gt;\n#include &lt;stdio.h&gt;\n\nint main()\n{\n    printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>

<pre><code>#include &lt;stdbool.h&gt;
#include &lt;stdio.h&gt;

int main()
{
    printf("Hello World\n");
}
</code></pre>

If you still need to navigate and edit it with BeautifulSoup, do the stuff below. Include the entity conversion if you need the '<' and '>' to be reinserted (instead of '<' and '>').

soup = BeautifulSoup(marked, 
                     convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code></pre>


def thickened(soup):
    """
    <code>
    blah blah <entity> blah
        blah
    </code>
    """
    codez = soup.findAll('code') # get the code tags
    for code in codez:
        # take all the contents inside of the code tags and convert
        # them into a single string
        escape_me = ''.join([k.__str__() for k in code.contents])
        escaped = cgi.escape(escape_me) # escape them with cgi
        code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
    return soup

@J.F.Sebastian: you're absolutely right, it was working for me, and I just realized the difference--I had passed it through markdown. Rewriting my answer. — BenjaminGolder, Oct 07 '11 at 00:28

score 0 · Answer 5 · edited May 23 '17 at 11:47

If <code> element contains unescaped <,&, > characters inside the code than it is not a valid html. BeautifulSoup will try to convert it to a valid html. It is probably not what you want.

To convert the text to a valid html you could adapt a regex that strips tags from an html to extract the text from a <code> block and replace it with cgi.escape() version. It should work fine if there is no nested <code> tags. After that you could feed sanitized html to BeautifulSoup.

Parsing a document with BeautifulSoup while not-parsing the contents of tags

5 Answers5

EDIT:

Parsing a document with BeautifulSoup while not-parsing the contents of `tags`