How to extract meaningful and useful content from web pages?

Question

I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.

In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.

I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.

There are two methods that I could think of:

Use a machine learning based algorithm (like this: http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ )
Develop a web scraper that could satisfactorily remove all clutter from web pages.

Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in PHP or Python which do this kind of web scraping?

If I need to build my own tool to do this, what would you guys suggest to go about it?

Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.

But I don't know how to extract content after this step.

PS. I am an amateur and would love if there were ready to use open source tools that do this, and can be easily integrated into my code that I'll write in PHP or Python. Or if I have to write my own code, I'd love to get guidance who's done such work before! :) Thanks a lot!

score 11 · Answer 1 · answered Dec 09 '12 at 20:46

11

did you type 'python readability' into google? there is a pretty popular (200+ followers) library on github.

https://github.com/buriy/python-readability

Additionally, there is a php one if you were to type 'php readability' though it has 100 followers it has not had activity for almost two years https://github.com/feelinglucky/php-readability

and finally the most popular (350+ github folowers) is the ruby readability port https://github.com/iterationlabs/ruby-readability

At the very least you can see how these 3 different projects accomplish parsing the "important parts" of a webpage.

answered Dec 09 '12 at 20:46

dm03514

54,664
18
108
145

Thanks a lot for the reply. As I said, I'm an amateur and I don't really know if this will work locally on my server, without internet access. I want to give an HTML document (saved on disk) and then will this be able to give me a 'clean' file back? Basically, is this an API to the readability service (requiring access to readability servers) or is this a self-sufficient code? Thanks! :) – user1271286 Dec 09 '12 at 21:11
@user1271286 these are libraries that don't require web requests. you can pass them html like with the python readability `readable_article = Document(html).summary() readable_title = Document(html).short_title()` `html` here is just a string of html – dm03514 Dec 09 '12 at 22:27
Thanks a lot for the help! :) Will work on it, and I'll post here how well it worked! – user1271286 Dec 11 '12 at 19:59
The most useful answer I have come across today. Thanks! – Harry May 29 '13 at 06:25

score 3 · Answer 2 · answered Dec 09 '12 at 22:32

3

You may use htql.

import htql
page="..."
query="&html_main_text"

result=htql.query(page, query)

answered Dec 09 '12 at 22:32

seagulf

380
3
5

Thanks! Looks quite simple to use! :) Will try it out! – user1271286 Dec 11 '12 at 19:59

How to extract meaningful and useful content from web pages?

2 Answers2