HTML Well-formedness parser

Question

Heyy guys, I need to determine if a given HTML Document is well formed or not.
I just need a simple implementation using only Java core API classes i.e. no third party stuff like JTIDY or something.

Actually, what is exactly needed is an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming one after the other on the list. If the list conforms to this order then it returns true or else false. I've already written methods to convert a tag to a close tag.

Here is the skeleton code of what I've started working on already. Its not too neat, but it should give you guys a basic idea of what I'm trying to do.

public boolean validateHtml(){

    ArrayList<String> tags = fetchTags();
    //fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]

    //I create another ArrayList to store tags that I haven't found its corresponding close tag yet
    ArrayList<String> unclosedTags = new ArrayList<String>();

    String temp;

    for (int i = 0; i < tags.size(); i++) {

        temp = tags.get(i);

        if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
            unclosedTags.add(tags.get(i));
            if(){

            }

        }else{
            return true;//well formed html
        }
    }

    return true;
}

I strongly recommend using a 3rd-party library for this. Unless this is an academic exercise, it's just not worth the time. Edge cases? `
`, self-closing tags, different doctypes... — Matt Ball, Mar 01 '11 at 20:17
possible duplicate of [How to validate HTML from Java?](http://stackoverflow.com/questions/4392505/how-to-validate-html-from-java) — Matt Ball, Mar 01 '11 at 20:18
I see that you've put some effort already into your (homework)? Good work! But, we try not to just "send da codez". Is there a specific question we can help you with, rather than just completing your algorithm? — rlb.usa, Mar 01 '11 at 20:20
@Matt Ball : I don't think so. It really sounds like homework, the OP asks for _no third party_, and the solutions there are all 3rd-party. — rlb.usa, Mar 01 '11 at 20:21

Highstead · Accepted Answer · 2011-03-01T21:02:41.833

Two thoughts. First off maybe you could get away with using an XML parser on the html? Potentially easier and vastly less time consuming.

I havn't put a whole lot of thought into this but to me it sounds like recursion and stack would be the way to go. Something like

public myClass(String htmlInput)
{
    openedTags = new Stack<String>();
    this.htmlInput = htmlInput;
}
public boolean validate()
{
    return validate(this.htmlInput);
}
private boolean validate(String html)
{
    boolean result = true;
    String curTag;
    while(htmlLeft)        //worker loop
    {

        if(isOneOffTag(curTag))                 //matches <tags />
            continue;
        else if(isOpenTag(curTag))              //matches <tags>
        {
            openedTags.push(curTag);
            if(!validate(innerHtml))
                return false;
        }
        else if(isCloseTag(curTag))             //matches </tags>
        {
            String lastTag = (String)openedTags.peek();
            if(!tagIsSimiliar(curTag, lastTag))
                return false;
            openedTags.pop();
        }
    }


    return result;
}
private String nextTag(){return null;}
private boolean isOpenTag(String tag){ return true;}
private boolean isCloseTag(String tag){ return true;}
private boolean isOneOffTag(String tag){ return true;}
private boolean tagIsSimiliar(String curTag, String lastTag){return true;}

*edit 1: probably should have pushed onto the stack.

**edit 2: I suppose the issue here would be to determine where when returning solely a boolean you've left off. This would require some kind of pointer so that you know where you've left off. The idea though i believe would still work.

Just an FYI, the common term for "one-off tag" is "self-closing tag." — Matt Ball, Mar 01 '11 at 21:42
hmmn, this is an interesting algorithm you developed here and so quick, cool, I think I'm beginning to understand what you're trying to do here. but I don't quite understand what you mean by the terms 'htmlLeft' and 'innerHtml' please elaborate. — sacretruth, Mar 01 '11 at 22:19
Hmm yeah, I don't think i spent enough time on this. Html left is just the outer working loop where it keeps running so long as there is html left to parse. I'll leave this how up to you. innerHtml is a term i picked up from C# i believe which would refer to this is the innerHtml. The issue would be only passing the inner html which would be probably more effort then it's worth. If you passed erroneous innerhtml you would have to use an elaborate check system to compensate. Strip the recursion out and it should work fine. I just really liked the recursion idea. — Highstead, Mar 02 '11 at 14:23

HTML Well-formedness parser

1 Answers1