-1

Is there an java API within the Maven repository that can parse a HTML document and verify if it is well-formed or not?

UPDATE:

The code in my program looks like this:

    url = "C:/Users/user1/Desktop/testHTML.html";
    FileInputStream fi = new FileInputStream(url);

    Tidy tidy = new Tidy();
    //tidy.setQuiet(true);
    tidy.parse(fi, null);
    //tidy.parseDOM(fi, fo);
    int tempWarnings = tidy.getParseWarnings();
    int tempErrors = tidy.getParseErrors();`

The contents on my HTML file are like this:

<html>
<head>
    <title>This is a sample doc</title>
</head>
<body>
    <p> <b>this is a sample paragraph</b></p>

However Tidy doesn't give any warnings or errors even when the DOCTYPE and are missing.

naren.katneni
  • 275
  • 1
  • 4
  • 10
  • 1
    http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers – assylias May 21 '13 at 20:16

2 Answers2

2

Yes, JTidy is in Maven....

it is a good library for a few HTML related activities.

rolfl
  • 17,539
  • 7
  • 42
  • 76
  • I've used Jtidy, but I couldn't find a method that just takes in a HTML file as input and says if it is well-formed or not. There are methods to try and clean up the file though, which I'm not looking for. – naren.katneni May 21 '13 at 20:24
1

As @rolfl said, you can use JTidy for this. The JTidy documentation kind of sucks though (and I've never used it before) so I downloaded it and tried using it. This test runs and gives you 3 warnings:

package com.sandbox;

import org.junit.Test;
import org.w3c.tidy.Tidy;

import java.io.StringReader;
import java.io.StringWriter;

import static org.junit.Assert.assertEquals;

public class SandboxTest {

    @Test
    public void myTest() {
        Tidy tidy = new Tidy();
        StringWriter writer = new StringWriter();
        tidy.parse(new StringReader("invalid html"), writer);
        assertEquals(0, tidy.getParseErrors());
        assertEquals(0, tidy.getParseWarnings());
    }
}

This assertion fails on the last line because it returns 3 instead of 0. Is that what you're looking for?


I tried using your input and I get a warning for it:

package com.sandbox;

import org.junit.Test;
import org.w3c.tidy.Tidy;

import java.io.StringReader;
import java.io.StringWriter;

import static org.junit.Assert.assertEquals;

public class SandboxTest {

    @Test
    public void myTest() {
        Tidy tidy = new Tidy();

        StringWriter writer = new StringWriter();
        tidy.parse(new StringReader("<html>\n" +
                "<head>\n" +
                "    <title>This is a sample doc</title>\n" +
                "</head>\n" +
                "<body>\n" +
                "    <p> <b>this is a sample paragraph</b></p>"), writer);
        assertEquals(0, tidy.getParseErrors());
        assertEquals(0, tidy.getParseWarnings());
    }
}

Output:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
InputStream: Document content looks like HTML 2.0
1 warning, no errors were found!

java.lang.AssertionError: 
Expected :0
Actual   :1
  <Click to see difference>

    at org.junit.Assert.fail(Assert.java:93)
    at org.junit.Assert.failNotEquals(Assert.java:647)
    at org.junit.Assert.assertEquals(Assert.java:128)
    at org.junit.Assert.assertEquals(Assert.java:472)
    at org.junit.Assert.assertEquals(Assert.java:456)
    at com.sandbox.SandboxTest.myTest(SandboxTest.java:25)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:77)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:195)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:63)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Process finished with exit code -1
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356
  • thanks for the suggestion. This is what I was looking for. But I tried out a few non well-formed HTMLs( tag not closed, no DOCTYPE etc.) and there are no warnings or errors for these. It is correctly throwing errors for some other things though ( inside ). Is there some parameter that needs to be set to make JTidy stricter? – naren.katneni May 21 '13 at 20:57
  • @naren.katneni by default it's complaining about `line 1 column 1 - Warning: missing declaration` so I don't see how you're avoiding that. Show us your code please. – Daniel Kaplan May 21 '13 at 21:16
  • Here is the code `String url = args[0]; url = "C:/Users/user1/Desktop/testHTML.html"; FileInputStream fi = new FileInputStream(url); Tidy tidy = new Tidy(); //tidy.setQuiet(true); tidy.parse(fi, null); //tidy.parseDOM(fi, fo); int tempWarnings = tidy.getParseWarnings(); int tempErrors = tidy.getParseErrors(); ` When I run the code, it says: Tidy (vers 4th August 2000) Parsing "InputStream" InputStream: Document content looks like HTML 2.0 no warnings or errors were found I don't have a DOCTYPE in the input html file. – naren.katneni May 21 '13 at 21:39
  • @naren.katneni It would be better if you edit your question to add this code to it. Especially important is the contents of the `testHTML.html` file. – Daniel Kaplan May 21 '13 at 21:41
  • I added the code to the question. – naren.katneni May 21 '13 at 21:56
  • @naren.katneni see my edit. Isn't this what you want? – Daniel Kaplan May 21 '13 at 22:07
  • Sorry for the late reply. I'm still not getting any warning even when the DOCTYPE is not specified and I'm unable to pass StringWrtier and StringReader as parameters to parse() method either. I'm starting to wonder if you have a newer version of the JTidy jar. Can you please let me know which version you are using? I got my copy from http://search.maven.org/#artifactdetails%7Cjtidy%7Cjtidy%7C4aug2000r7-dev%7Cjar . Thanks. – naren.katneni May 22 '13 at 15:20
  • net.sf.jtidy jtidy r938 – Daniel Kaplan May 22 '13 at 16:29
  • Thanks for the information, that helped. The DOCTYPE is being checked with this version of jar file. But it is still not recognizing unclosed tags ( tag is not closed in your code but no warning shows up). – naren.katneni May 22 '13 at 18:10