how to extract web page textual content in java?

Question

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks

Best Way is using "compile 'org.jsoup:jsoup:1.9.2'" – VahidHoseini Sep 26 '16 at 18:58 — VahidHoseini, Sep 26 '16 at 18:58

score 14 · Answer 1 · answered Jun 14 '10 at 11:12

14

Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

answered Jun 14 '10 at 11:12

Pascal Thivent

562,542
136
1,062
1,124

Love jsoup but it doesn't execute associated Javascript. For Javascript rendered pages I use Selenium. – Angsuman Chakraborty Dec 24 '21 at 11:16

score 13 · Accepted Answer · edited May 23 '17 at 12:00

Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Related questions

Text extraction:

Tag stripping:

score 3 · Answer 3 · answered Jun 14 '10 at 11:13

Here's a short method that nicely wraps these details (based on java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}

how to extract web page textual content in java?

3 Answers3

Related questions

Linked

Related