13

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

Rigor Mortis
  • 153
  • 1
  • 1
  • 5
  • 1
    Just parse the text from the HTML tags. From there you can find the info you want and extract it from there. –  Mar 22 '12 at 15:49
  • If you are looking for HTML to DOM http://stackoverflow.com/questions/457684/reading-html-file-to-dom-tree-using-java can help you. – Jaydeep Patel Mar 22 '12 at 16:06
  • 5
    FYI - You are calling in.readLine() twice per iteration, so you actually are skipping every odd line. (Just thought I should point out the bug in this code since it is one of the first results for a google search on reading webpages with Java.) – JPProgrammer Nov 07 '13 at 04:54

5 Answers5

18

You may want to have a look at jsoup for this:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

Fabian Barney
  • 14,219
  • 5
  • 40
  • 60
4

Use JSoup.

You will be able to parse the content using css style selectors.

In this example you can try

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();
Nitzan Volman
  • 1,809
  • 3
  • 17
  • 31
0

You can also use HtmlCleaner jar. Below is the code.

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );
user2988879
  • 379
  • 2
  • 6
  • 18
0
} catch (MalformedURLException e) {
} catch (IOException e) {
}

add at least e.printStackTrace() Will save you many days of your life

0

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

Paaske
  • 4,345
  • 1
  • 21
  • 33