8

So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.

I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?

I have also found this product: http://www.diffbot.com/products/automatic/article/ and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.

Many thanks

Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function. ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.

Saad Attieh
  • 1,396
  • 3
  • 21
  • 42
  • I think what you're looking for is a web scraper, take a look at this question (and answer): http://stackoverflow.com/questions/3202305/web-scraping-with-java – Mekswoll Dec 24 '13 at 23:33
  • The new Instapaper API might be a great choice for many now: https://www.instapaper.com/api – Jakub Kotowski Apr 24 '16 at 17:25

3 Answers3

3

Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.

Jakub Kotowski
  • 7,411
  • 29
  • 38
3

I have found an open source solution which was extremely highly rated. https://code.google.com/p/boilerpipe/

A review on different text extraction algorithms: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.

Saad Attieh
  • 1,396
  • 3
  • 21
  • 42
  • The given link is dead. The new link is http://tomazkovacic.com/blog/2011/06/09/evaluating-text-extraction-algorithms/ – Onur Uslu Aug 10 '22 at 20:34
-1

This is not the answer to every malformed HTML you can get, but most of the time jtidy does a good job cleaning the HTML and giving you an interface for accessing the various DOM nodes, and with that access to the text inside that nodes.

lwi
  • 1,682
  • 12
  • 21