Java HTML Parsing

Question

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

div class = "classname"

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

boolean usesClass(String CSSClassname);
String getText();
String getLink();

Related: [What are the pros and cons of the leading Java HTML parsers?](http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers) — BalusC, Aug 21 '11 at 14:06

score 60 · Answer 1 · edited Dec 24 '13 at 09:40

60

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

http://jsoup.org/

edited Dec 24 '13 at 09:40

informatik01

16,038
10
74
104

answered May 18 '11 at 09:33

rajsite

880
1
8
10

Is there any method without going for an external jar? – Futuregeek Jul 02 '16 at 08:36
1

@Futuregeek I used to use regex, until I read [this answer](https://stackoverflow.com/a/1732454/5484609) – Kartik Chugh Jul 12 '17 at 16:17

score 20 · Answer 2 · answered Oct 26 '08 at 14:55

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

score 18 · Accepted Answer · answered Oct 26 '08 at 16:06

Several years ago I used JTidy for the same purpose:

http://jtidy.sourceforge.net/

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

More information on JTidy can be found on the JTidy SourceForge project page ."

JTidy seems an abandoned project, haven't been updated for a few years now. — rlegendi, Jul 10 '13 at 11:36

score 13 · Answer 4 · answered Oct 26 '08 at 14:16

13

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.

answered Oct 26 '08 at 14:16

PhiLho

40,535
6
96
134

score 5 · Answer 5 · answered Oct 26 '08 at 14:23

The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

score 5 · Answer 6 · answered Jan 21 '11 at 18:36

5

Jericho: http://jericho.htmlparser.net/docs/index.html

Easy to use, supports not well formed HTML, a lot of examples.

answered Jan 21 '11 at 18:36

FolksLord

992
2
9
17

score 4 · Answer 7 · answered Oct 26 '08 at 19:16

4

HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/1

answered Oct 26 '08 at 19:16

alex

5,213
1
24
33

igr · Answer 8 · 2014-10-03T15:20:20.150

4

Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.

Example:

Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");

Example:

doc.form("#myform", new JerryFormHandler() {
    public void onForm(Jerry form, Map<String, String[]> parameters) {
        // process form and parameters
    }
});

Of course, these are just some quick examples to get the feeling how it all looks like.

edited Oct 03 '14 at 15:20

answered Jan 08 '12 at 17:37

igr

10,199
13
65
111

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Engineer2021 May 30 '14 at 11:44
1

Thanx, added example. I haven't seen examples on most of the other comments, so I followed the same pattern. – igr May 30 '14 at 13:05
No problem. It showed up in the Low Quality Queue. My comment is automated by SO. – Engineer2021 May 30 '14 at 19:04

score 3 · Answer 9 · answered Aug 19 '11 at 00:13

The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.

The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

score 1 · Answer 10 · edited Nov 10 '12 at 08:45

1

You can also use XWiki HTML Cleaner:

It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.

edited Nov 10 '12 at 08:45

Yaroslav

6,476
10
48
89

answered Oct 04 '11 at 15:54

Vincent Massol

165
16

score 0 · Answer 11 · edited Jan 10 '13 at 09:27

0

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

edited Jan 10 '13 at 09:27

musiKk

14,751
4
55
82

answered Oct 26 '08 at 14:01

Yuval

7,987
12
40
54

2

If your HTML is well-formed. Is it ever? – PlexQ Apr 22 '12 at 07:03
because I work on projects with other people, some of whom are designers who don't create perfect HTML, and many others don't either, doubly so when tempting is used. – PlexQ Apr 23 '12 at 21:02
6

This is plain wrong, only XHTML can be parsed with an XML parser. – musiKk Jan 10 '13 at 09:27

Java HTML Parsing

11 Answers11

Linked

Related