I backed up my blog in Google's XML format. It's quite long. So far, I have done this:
>>> import feedparser
>>> blogxml = feedparser.parse('blog.xml')
>>> type(blogxml)
<class 'feedparser.FeedParserDict'>
In the book I'm reading, the author does this:
>>> import feedparser
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title'] u'Language Log'
>>> len(llog.entries) 15
>>> post = llog.entries[2]
>>> post.title u"He's My BF"
>>> content = post.content[0].value
>>> content[:70] u'<p>Today I was chatting with three of our visiting graduate students f'
>>> nltk.word_tokenize(nltk.html_clean(content))
And that works for me on an entry-by-entry basis. As you can see, I've already got a way of cleaning HTML using the NLTK. But what I really want is to grab all the entries, clean them of HTML (which I already know how to do and am not asking how to do, read the question a bit more carefully please), and write them to a file as a plaintext string. Which has more to do with using feedparser correctly. Is there a simple way to do that?
Update:
I'm still no closer, as it turns out, to finding an easy way to do it. Due to my ineptitude with python, I was forced to do something a bit ugly.
This is what I thought I'd do:
import feedparser
import nltk
blog = feedparser.parse('myblog.xml')
with open('myblog','w') as outfile:
for itemnumber in range(0, len(blog.entries)):
conts = blog.entries[itemnumber].content
cleanconts = nltk.word_tokenize(nltk.html_clean(conts))
outfile.write(cleanconts)
So, thank you very much, @Rob Cowie, but your version (which looks great) did not work. I feel bad for not pointing that out earlier, and for accepting the answer, but I don't have much time to work on this project. The stuff I put below is all I could get to work, but I'm leaving this question open in case someone has something more elegant.
import feedparser
import sys
blog = feedparser.parse('myblog.xml')
sys.stdout = open('blog','w')
for itemnumber in range(0, len(blog.entries)):
print blog.entries[itemnumber].content
sys.stdout.close()
then I CTRL-D'ed out of the interpreter, because I had no idea how to close the open file without closing Python's stdout. Then I re-entered the interpreter, opened the file, read the file, and cleaned the HTML from there. (nltk.html_clean is a typo in the online version of the NLTK book itself, by the way... it's actually nltk.clean_html). What I ended up with was almost, but not quite, plaintext.