1

I backed up my blog in Google's XML format. It's quite long. So far, I have done this:

>>> import feedparser
>>> blogxml = feedparser.parse('blog.xml')
>>> type(blogxml)
<class 'feedparser.FeedParserDict'>

In the book I'm reading, the author does this:

>>> import feedparser
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title'] u'Language Log'
>>> len(llog.entries) 15
>>> post = llog.entries[2]
>>> post.title u"He's My BF"
>>> content = post.content[0].value
>>> content[:70] u'<p>Today I was chatting with three of our visiting graduate students f'
>>> nltk.word_tokenize(nltk.html_clean(content))

And that works for me on an entry-by-entry basis. As you can see, I've already got a way of cleaning HTML using the NLTK. But what I really want is to grab all the entries, clean them of HTML (which I already know how to do and am not asking how to do, read the question a bit more carefully please), and write them to a file as a plaintext string. Which has more to do with using feedparser correctly. Is there a simple way to do that?

Update:

I'm still no closer, as it turns out, to finding an easy way to do it. Due to my ineptitude with python, I was forced to do something a bit ugly.

This is what I thought I'd do:

import feedparser
import nltk

blog = feedparser.parse('myblog.xml')

with open('myblog','w') as outfile:
    for itemnumber in range(0, len(blog.entries)):
        conts = blog.entries[itemnumber].content
        cleanconts = nltk.word_tokenize(nltk.html_clean(conts))
        outfile.write(cleanconts)

So, thank you very much, @Rob Cowie, but your version (which looks great) did not work. I feel bad for not pointing that out earlier, and for accepting the answer, but I don't have much time to work on this project. The stuff I put below is all I could get to work, but I'm leaving this question open in case someone has something more elegant.

import feedparser
import sys

blog = feedparser.parse('myblog.xml')
sys.stdout = open('blog','w')

for itemnumber in range(0, len(blog.entries)):
    print blog.entries[itemnumber].content

sys.stdout.close()

then I CTRL-D'ed out of the interpreter, because I had no idea how to close the open file without closing Python's stdout. Then I re-entered the interpreter, opened the file, read the file, and cleaned the HTML from there. (nltk.html_clean is a typo in the online version of the NLTK book itself, by the way... it's actually nltk.clean_html). What I ended up with was almost, but not quite, plaintext.

magnetar
  • 6,487
  • 7
  • 28
  • 40
  • possible duplicate of [Extracting text from HTML file using Python](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) –  Jun 29 '11 at 19:01
  • @Sentinel it's not a duplicate... my question has more to do with feedparser. i know how to clean HTML, and I've already shown that I can do that. I just don't know how to do it on every entry with feedparser. – magnetar Jul 03 '11 at 08:48

1 Answers1

1
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

with open('myblog.txt', 'w') as outfile:
    for entry in llog.entries:
        ## Do your processing here
        content = entry.content[0].value
        clean_content = nltk.word_tokenize(nltk.html_clean(content))
        outfile.write(clean_content)

Fundamentally, you need to open a file, iterate the entries (feed.entries), process the entry as required and write the appropriate representation to the file.

I make no assumption about how you want to delimit the post content in the text file. This snippet also doesn't write the post title, or any metadata to the file.

Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
  • I believe you have to iterate over the entries by doing something like in this blog post: http://frizzletech.blogspot.com/2011/02/how-i-created-my-weekly-feed-digest.html ... you can't just write post.content[0]... – magnetar Aug 02 '11 at 22:00
  • @magnetar; You spotted an error in my example. I _am_ looping over the entries, but referencing `post` would raise a NameError. Copy/paste error I think. – Rob Cowie Aug 03 '11 at 13:02