6

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks

Radi
  • 6,548
  • 18
  • 63
  • 91

3 Answers3

14

Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

Pascal Thivent
  • 562,542
  • 136
  • 1,062
  • 1,124
13

Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Related questions

Text extraction:

Tag stripping:

Community
  • 1
  • 1
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
3

Here's a short method that nicely wraps these details (based on java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}
Itay Maman
  • 30,277
  • 10
  • 88
  • 118