Best way to extract text (e.g. articles) from web page

Question

So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.

I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?

I have also found this product: http://www.diffbot.com/products/automatic/article/ and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.

Many thanks

Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function. ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.

I think what you're looking for is a web scraper, take a look at this question (and answer): http://stackoverflow.com/questions/3202305/web-scraping-with-java — Mekswoll, Dec 24 '13 at 23:33
The new Instapaper API might be a great choice for many now: https://www.instapaper.com/api — Jakub Kotowski, Apr 24 '16 at 17:25

score 3 · Answer 1 · answered Dec 25 '13 at 00:17

3

Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.

answered Dec 25 '13 at 00:17

Jakub Kotowski

7,411
29
38

score 3 · Answer 2 · answered Dec 25 '13 at 00:51

3

I have found an open source solution which was extremely highly rated. https://code.google.com/p/boilerpipe/

A review on different text extraction algorithms: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.

answered Dec 25 '13 at 00:51

Saad Attieh

1,396
3
21
42

The given link is dead. The new link is http://tomazkovacic.com/blog/2011/06/09/evaluating-text-extraction-algorithms/ – Onur Uslu Aug 10 '22 at 20:34

score -1 · Answer 3 · answered Dec 24 '13 at 23:41

-1

This is not the answer to every malformed HTML you can get, but most of the time jtidy does a good job cleaning the HTML and giving you an interface for accessing the various DOM nodes, and with that access to the text inside that nodes.

answered Dec 24 '13 at 23:41

lwi

1,682
12
21

Best way to extract text (e.g. articles) from web page

3 Answers3