0

I am working on a stackoverflow data project and I have created a database dump file from the xml files provided by stackexchange data dump for stackoverflow. Now I want to create a pickled file for the posts but I want to remove the programming code present in the posts.I just want the text of the post. How can I do it?

My current code status is:

import cPickle as pickle

result = list()

def dumpDB():
    conn = sqlite3.connect('Path of data dump file')

    cur = conn.cursor()
    cur.execute('SELECT Title, Body, OwnerUserID, Tags, Id FROM posts where Tags is not null')

    for item in cur:
        record = list(item)
        doc = lxml.html.document_fromstring(record[1])
        record[1] = str(doc.text_content()) # this will strip the html tags

NOTE: If there is some error in explanation or code, I would appreciate your help in pinpointing them so as to correct them. I am new at this and as such there could be some mistakes.

EDIT: I saw the post which some suggested as having the answer for this problem, but I am not able to relate how to use the suggested solution in that post for my present issue. As suggested by the first comment below, I have to look for contents inside <pre><code> tags and I have to remove everything present inside them. How can I use the remove() to do that? I also looked at another way of doing it through BeautifulSoup but couldn't find way to use it for my case

user2966197
  • 2,793
  • 10
  • 45
  • 77
  • If you want to strip out formatted code (like the code block in your question), I'd suggest looking for `
    ` and `` tags before you strip the HTML from the post body. Alas, I don't know enough about `xml` processing to tell you how to do that.
    – Blckknght May 04 '14 at 09:19
  • @Blckknght is the formatted code present between `` and `` tags? This will affect the xml processing as most the lxml in built functions assume a consistent starting and ending of the tag, like if it starts with `prev` then it should end also with `prev`. I think my understanding is not wrong – user2966197 May 04 '14 at 21:09
  • It's not between the different tags, but rather, within opening and closing tags of one of those types (or sometimes both, nested, e.g. `
    stuff
    `). Perhaps my "and" should have been "and/or". I described both because I'm not sure if you want to filter all code (including inline things like `foo`) or just the big pre-foratted blocks.
    – Blckknght May 04 '14 at 22:18
  • @Blckknght I got what you meant to say. But that is the problem that how to remove contents between `
    stuff
    `. I looked at few examples but none had a combination of tags.
    – user2966197 May 04 '14 at 22:31
  • Well, if you remove the contents of the `
    ` tags, the inner `` tags will go too.
    – Blckknght May 04 '14 at 23:38

0 Answers0