I am working on a stackoverflow data project and I have created a database dump file from the xml files provided by stackexchange data dump for stackoverflow. Now I want to create a pickled file for the posts but I want to remove the programming code present in the posts.I just want the text of the post. How can I do it?
My current code status is:
import cPickle as pickle
result = list()
def dumpDB():
conn = sqlite3.connect('Path of data dump file')
cur = conn.cursor()
cur.execute('SELECT Title, Body, OwnerUserID, Tags, Id FROM posts where Tags is not null')
for item in cur:
record = list(item)
doc = lxml.html.document_fromstring(record[1])
record[1] = str(doc.text_content()) # this will strip the html tags
NOTE: If there is some error in explanation or code, I would appreciate your help in pinpointing them so as to correct them. I am new at this and as such there could be some mistakes.
EDIT: I saw the post which some suggested as having the answer for this problem, but I am not able to relate how to use the suggested solution in that post for my present issue. As suggested by the first comment below, I have to look for contents inside <pre><code>
tags and I have to remove everything present inside them. How can I use the remove()
to do that? I also looked at another way of doing it through BeautifulSoup
but couldn't find way to use it for my case
` tags? This will affect the xml processing as most the lxml in built functions assume a consistent starting and ending of the tag, like if it starts with `prev` then it should end also with `prev`. I think my understanding is not wrong