I am trying to make something which allows people to put in a url from an article from for example the verge. What it does is reads the url/article and display it in a nice way like readability. But i am really stuck i can't find information anywhere on how to do it. Is there any api out there on how to do this. It's actually instead of scanning a whole rss feed only one article.
Asked
Active
Viewed 1,432 times
3 Answers
0
Should be the easiest way: http://simplehtmldom.sourceforge.net/
You can simply target elements like with css/jquery

Wurstbro
- 974
- 1
- 9
- 21
0
You can do this quick-and-dirty with regular expressions, or you can import the DOM. Note that the solution that works for one website is very unlikely to work for another with no changes, whether you use regex or properly parse the DOM.

Eliot Ball
- 698
- 5
- 11
-
1Welcome to Stack Overflow. Prepare for endless flaming for daring to mention [parsing HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Jezen Thomas Sep 02 '12 at 09:59
-
I'm not claiming that one can parse HTML with regex, as I know this to be false. Merely I am stating that one can pull out snippets of text that are consistently surrounded by the same HTML using regex. – Eliot Ball Sep 02 '12 at 10:01
-
My comment was *totally* tongue-in-cheek :) – Jezen Thomas Sep 02 '12 at 10:02
0
You are looking for boilerpipe. It should do exactly what you want. There is even a web API. You can also download the module and use it from a Python script.
You can test it out on an article of your choice here: http://boilerpipe-web.appspot.com. Just select ArticleExtractor as the extractor.

stuckintheshuck
- 2,449
- 3
- 27
- 33