How to read a text from a web page with Java?

Question

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

Just parse the text from the HTML tags. From there you can find the info you want and extract it from there. — , Mar 22 '12 at 15:49
If you are looking for HTML to DOM http://stackoverflow.com/questions/457684/reading-html-file-to-dom-tree-using-java can help you. — Jaydeep Patel, Mar 22 '12 at 16:06
FYI - You are calling in.readLine() twice per iteration, so you actually are skipping every odd line. (Just thought I should point out the bug in this code since it is one of the first results for a google search on reading webpages with Java.) — JPProgrammer, Nov 07 '13 at 04:54

score 18 · Accepted Answer · answered Mar 22 '12 at 15:59

You may want to have a look at jsoup for this:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

score 4 · Answer 2 · answered Mar 22 '12 at 15:59

Use JSoup.

You will be able to parse the content using css style selectors.

In this example you can try

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();

score 0 · Answer 3 · edited Apr 30 '15 at 13:55

0

You can also use HtmlCleaner jar. Below is the code.

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );

edited Apr 30 '15 at 13:55

user2988879

379
2
6
18

answered May 07 '13 at 08:59

Prabuddha

1

score 0 · Answer 4 · answered Jan 11 '22 at 07:52

0

} catch (MalformedURLException e) {
} catch (IOException e) {
}

add at least e.printStackTrace() Will save you many days of your life

answered Jan 11 '22 at 07:52

Lukasz Ronikier

21
1

score 0 · Answer 5 · answered Mar 22 '12 at 15:51

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

How to read a text from a web page with Java?

5 Answers5

Linked