Given n number of raw URLs, I'd like to be able to classify them by: news, blog, photo and video.
An example would be if a link directs a user to a photo, would it be enough to say that the raw link contains file extension for images to be able to classify the raw URL as photo?
As for video, blog and news, it seems it isn't enough to have a set of domains (like http://www.youtube.com) that will classify the raw URLs.
Could classification be done by examining the web content? Or are there any open source tools for this?