0

(Neophyte post, apologies and thanks up front!)

My goal is to build a small app that monitors and parses a set of blogs' posts for outbound links, so I can then:

  1. Display top linked-to articles among the blogs in one frame; and,
  2. For a given linked-to article, display the posts (in my blogosphere) that link to it.

So far my idea is to use:
- Python (with Django or some-such front end)
- Feedparser to read feeds and extract links from posts
- URLparse

The Big Question: am I missing anything obvious that would make this way easier?

Smaller question (that I can't figure out yet):
- Since outbound link URLs may differ even when pointing to the same article (NYT URLs and tinyURLs, for example), how can I check a URL to see if it already in my list of linked-items beyond just comparing the absolute URL?

This SO post was helpful at a high level, but parsing 'blogroll'-style link lists seems a lot easier than actively comparing URLs within a post, particularly to news sites that may do all sorts of funny things in their URLs.

Community
  • 1
  • 1
Dave Guarino
  • 509
  • 5
  • 14
  • Considering that the forwarding happens on the server side I don't see any simpler way than following the links and then checking where they really point to (basically open url, call `geturl()` on the response object) – Voo Sep 12 '11 at 00:31

1 Answers1

1

I would go for the same setup. You'll probably need lxml to parse and manipulate the post content HTML (extract a tags).

Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435