Extract content out web articles and display them in a nice way

Question

I am trying to make something which allows people to put in a url from an article from for example the verge. What it does is reads the url/article and display it in a nice way like readability. But i am really stuck i can't find information anywhere on how to do it. Is there any api out there on how to do this. It's actually instead of scanning a whole rss feed only one article.

score 0 · Answer 1 · answered Sep 02 '12 at 09:54

0

Should be the easiest way: http://simplehtmldom.sourceforge.net/

You can simply target elements like with css/jquery

answered Sep 02 '12 at 09:54

Wurstbro

974
1
9
21

score 0 · Answer 2 · answered Sep 02 '12 at 09:56

0

You can do this quick-and-dirty with regular expressions, or you can import the DOM. Note that the solution that works for one website is very unlikely to work for another with no changes, whether you use regex or properly parse the DOM.

answered Sep 02 '12 at 09:56

Eliot Ball

698
5
11

1

Welcome to Stack Overflow. Prepare for endless flaming for daring to mention [parsing HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Jezen Thomas Sep 02 '12 at 09:59
I'm not claiming that one can parse HTML with regex, as I know this to be false. Merely I am stating that one can pull out snippets of text that are consistently surrounded by the same HTML using regex. – Eliot Ball Sep 02 '12 at 10:01
My comment was *totally* tongue-in-cheek :) – Jezen Thomas Sep 02 '12 at 10:02

stuckintheshuck · Accepted Answer · 2012-09-05T22:18:24.310

0

You are looking for boilerpipe. It should do exactly what you want. There is even a web API. You can also download the module and use it from a Python script.

You can test it out on an article of your choice here: http://boilerpipe-web.appspot.com. Just select ArticleExtractor as the extractor.

edited Sep 05 '12 at 22:18

answered Sep 05 '12 at 15:47

stuckintheshuck

2,449
3
27
33

Extract content out web articles and display them in a nice way

3 Answers3