How to Parse Only Text from HTML

Question

how can i parse only text from a web page using jsoup using java?

score 19 · Accepted Answer · answered Aug 17 '10 at 22:13

19

From jsoup cookbook: http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"

answered Aug 17 '10 at 22:13

Ryan Berger

9,644
6
44
56

how to exclude invisible elements? (e.g. display: none) – Ehsan Jun 19 '13 at 06:51

score 2 · Answer 2 · answered Aug 17 '10 at 23:14

Using classes that are part of the JDK:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

class GetHTMLText
{
    public static void main(String[] args)
        throws Exception
    {
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

        // Create a reader on the HTML content.

        Reader rd = getReader(args[0]);

        // Parse the HTML.

        kit.read(rd, doc, 0);

        //  The HTML text is now stored in the document

        System.out.println( doc.getText(0, doc.getLength()) );
    }

    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

score 0 · Answer 3 · answered Aug 17 '10 at 22:15

0

Well, here is a quick method I threw together once. It uses regular expressions to get the job done. Most people will agree that this is not a good way to go about doing it. SO, use at your own risk.

public static String getPlainText(String html) {
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1");
    plainTextBody = plainTextBody.replaceAll("<br ?/>", "");
    return decodeHtml(plainTextBody);
}

This was originally used in my API wrapper for the Stack Overflow API. So, it was only tested under a small subset of html tags.

answered Aug 17 '10 at 22:15

jjnguy

136,852
53
295
323

Hmmm... why don't you use simple regexp: `replaceAll("<[^>]+>", "")`? – Crozin Aug 17 '10 at 22:28
@Crozin, well I was teaching myself how to use the back-references I guess. It looks like yours would probably work too. – jjnguy Aug 17 '10 at 22:31
this hurts! -> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – sleeplessnerd Aug 27 '11 at 13:54
@sleep, I'm well aware that parsing html with regex can be a terrible idea. But sometimes it is actually an OK choice. I mentioned that they should use it at their own risk. – jjnguy Aug 27 '11 at 17:17
@jjnguy: :) - just for the fun of it – sleeplessnerd Aug 27 '11 at 21:55

How to Parse Only Text from HTML

3 Answers3

Linked