How to remove program code from the posts of the stackoverflow data using python

Question

I am working on a stackoverflow data project and I have created a database dump file from the xml files provided by stackexchange data dump for stackoverflow. Now I want to create a pickled file for the posts but I want to remove the programming code present in the posts.I just want the text of the post. How can I do it?

My current code status is:

import cPickle as pickle

result = list()

def dumpDB():
    conn = sqlite3.connect('Path of data dump file')

    cur = conn.cursor()
    cur.execute('SELECT Title, Body, OwnerUserID, Tags, Id FROM posts where Tags is not null')

    for item in cur:
        record = list(item)
        doc = lxml.html.document_fromstring(record[1])
        record[1] = str(doc.text_content()) # this will strip the html tags

NOTE: If there is some error in explanation or code, I would appreciate your help in pinpointing them so as to correct them. I am new at this and as such there could be some mistakes.

EDIT: I saw the post which some suggested as having the answer for this problem, but I am not able to relate how to use the suggested solution in that post for my present issue. As suggested by the first comment below, I have to look for contents inside <pre><code> tags and I have to remove everything present inside them. How can I use the remove() to do that? I also looked at another way of doing it through BeautifulSoup but couldn't find way to use it for my case

If you want to strip out formatted code (like the code block in your question), I'd suggest looking for `
` and `` tags before you strip the HTML from the post body. Alas, I don't know enough about `xml` processing to tell you how to do that. — Blckknght, May 04 '14 at 09:19
@Blckknght is the formatted code present between `` and `` tags? This will affect the xml processing as most the lxml in built functions assume a consistent starting and ending of the tag, like if it starts with `prev` then it should end also with `prev`. I think my understanding is not wrong — user2966197, May 04 '14 at 21:09
It's not between the different tags, but rather, within opening and closing tags of one of those types (or sometimes both, nested, e.g. `
stuff
`). Perhaps my "and" should have been "and/or". I described both because I'm not sure if you want to filter all code (including inline things like `foo`) or just the big pre-foratted blocks. — Blckknght, May 04 '14 at 22:18
@Blckknght I got what you meant to say. But that is the problem that how to remove contents between `
stuff
`. I looked at few examples but none had a combination of tags. — user2966197, May 04 '14 at 22:31
Well, if you remove the contents of the `
` tags, the inner `` tags will go too. — Blckknght, May 04 '14 at 23:38

How to remove program code from the posts of the stackoverflow data using python

0 Answers0