i am looking for a method to extract text from web page (initially html) using jdk or another library . please help
thanks
Use jsoup. This is currently the most elegant library for screen scraping.
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();
I just love its CSS selector syntax.
Use a HTML parser if at all possible; there are many available for Java.
Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.
Text extraction:
Tag stripping:
Here's a short method that nicely wraps these details (based on java.util.Scanner
):
public static String get(String url) throws Exception {
StringBuilder sb = new StringBuilder();
for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
sb.append(sc.nextLine()).append('\n');
return sb.toString();
}
And this is how it is used:
public static void main(String[] args) throws Exception {
System.out.println(get("http://www.yahoo.com"));
}