2

I would like to extract the text of the article for a given URL.

Do you know if it exist some library or existing code which is able to do that ?

Here is an example of URL : http://fr.news.yahoo.com/france-foot-pro-vote-gr%C3%A8ve-fin-novembre-contre-125358890.html

Thanks

Regards

wawanopoulos
  • 9,614
  • 31
  • 111
  • 166
  • 2
    http://stackoverflow.com/questions/3036638/how-to-extract-web-page-textual-content-in-java – Kakalokia Oct 24 '13 at 14:07
  • Just to spare time for some people - https://github.com/milosmns/goose - Goose for Android extracts text and other info, see the dev page for more information. – milosmns Sep 05 '16 at 08:42

2 Answers2

1

You need to use JTomatoSoup Its uses is:

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

The site also has a simple get started example but here is an SSCCE from Mykong:

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLParserExample1 {

  public static void main(String[] args) {

    Document doc;
    try {

        // need http protocol
        doc = Jsoup.connect("http://google.com").get();

        // get page title
        String title = doc.title();
        System.out.println("title : " + title);

        // get all links
        Elements links = doc.select("a[href]");
        for (Element link : links) {

            // get the value from href attribute
            System.out.println("\nlink : " + link.attr("href"));
            System.out.println("text : " + link.text());

        }

    } catch (IOException e) {
        e.printStackTrace();
    }

  }

}  

Website: http://jsoup.org/

An SO User
  • 24,612
  • 35
  • 133
  • 221
0

I particularly like using the Apache HTTPClient library. You can create HTTP requests pretty easily and parse the results however you need to. Here's a very bare bones example using your URL (but no parsing).

import java.io.IOException;

import org.apache.http.HttpHost;
import org.apache.http.HttpResponse;
import org.apache.http.ParseException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.conn.params.ConnRoutePNames;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;


public class Test {

    public static void main(String[] args) throws ParseException, IOException {     
        DefaultHttpClient httpclient = new DefaultHttpClient();     

        HttpGet httpget = new HttpGet("http://fr.news.yahoo.com/france-foot-pro-vote-gr%C3%A8ve-fin-novembre-contre-125358890.html");
        HttpResponse response = httpclient.execute(httpget);
        String responseText = EntityUtils.toString(response.getEntity());
        EntityUtils.consumeQuietly(response.getEntity());

        System.out.println(responseText);
    }

}
Chill
  • 1,093
  • 6
  • 13
  • but `JSoup` is better. It is very tasty and good for health, you know? :D – An SO User Oct 24 '13 at 14:20
  • I generally use Apache for JSON web services, so in this case it probably isn't the easiest. I mostly prefer it out of familiarity, I think. – Chill Oct 24 '13 at 14:34