1

I'm getting a weird behavior with javax.swing.text.ElementIterator(). It never shows all elements, and it shows a different amount of elements depending on what type of ParserCallback I use. The test below is done with the website that is in my profile, but can be done with any other big html file.

// some imports shown in case its an import mixup
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.ChangedCharSetException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

// Shows whats in an element, recursively
public void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException
{
    AttributeSet attributes = element.getAttributes();
    System.out.println("element: '" + element.toString().trim() + "', name: '" + element.getName() + "', children: " + element.getElementCount() + ", attributes: " + attributes.getAttributeCount() + ", leaf: " + element.isLeaf());
    Enumeration attrEnum = attributes.getAttributeNames();
    while (attrEnum.hasMoreElements())
    {
        Object attr = attrEnum.nextElement();
        System.out.println("\tAttribute: '" + attr + "', Val: '" + attributes.getAttribute(attr) + "'");
        if (attr == StyleConstants.NameAttribute
                && attributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT)
        {
            int startOffset = element.getStartOffset();
            int endOffset = element.getEndOffset();
            int length = endOffset - startOffset;
            System.out.printf("\t\tContent (%d-%d): '%s'\n", startOffset, endOffset, htmlDoc.getText(startOffset, length).trim());
        }
    }
    for (int i = 0; i < element.getElementCount(); i++)
    {
        Element child = element.getElement(i);
        printElement(htmlDoc, child);
    }
}

public void tryParse(String filename) 
        throws FileNotFoundException, IOException, BadLocationException
{
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filename)));

    Parser parser = new ParserDelegator();
    HTMLEditorKit htmlKit = new HTMLEditorKit();
    HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
    ParserCallback callback2 = htmlDoc.getReader(0);
    ParserCallback callback1 =
            new HTMLEditorKit.ParserCallback()
            {
            };

    parser.parse(in, callback2, true);
    ElementIterator iterator = new ElementIterator(htmlDoc);
    Element element;
    while ((element = iterator.next()) != null)
        printElement(htmlDoc, element);
    in.close();
}

In the test above, the results vary if I use callback1 or callback2. Even weirder, if I do fill the callbacks with the appropriate functions and have them output something, they show that the parser does handle the whole website, but the ElementIterator still doesn't have it all.

I've also tried to use htmlKit.read() instead of parser.parse(), but it still doesn't work.

Although I'm now getting my desired results by using the parser callback functions (not shown here), I still wonder why ElementIterator doesn't work as expected in case I need it later, so I wonder if anyone here has experience with that ElementIterator and can answer.

Update: Complete Java Source uploaded here: http://home.snafu.de/tilman/tmp/Main.java

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97

1 Answers1

1

Using the approach seen here, I haven't noticed the problem you describe. I added a println(), and all the elements seem to be there.

Addendum: I'm not sure how your tryParse() fails, but your printElement() seems to work from my main():

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Enumeration;
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

/** @see https://stackoverflow.com/questions/2882782 */
public class NewMain {

    public static void main(String args[]) throws Exception {
        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlKit.read(new BufferedReader(new FileReader("test.html")), htmlDoc, 0);
        ElementIterator iterator = new ElementIterator(htmlDoc);
        Element element;
        while ((element = iterator.next()) != null) {
            printElement(htmlDoc, element);
        }
    }
    private static void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException {
        AttributeSet attrSet = element.getAttributes();
        System.out.println(""
            + "Element: '" + element.toString().trim()
            + "', name: '" + element.getName()
            + "', children: " + element.getElementCount()
            + ", attributes: " + attrSet.getAttributeCount()
            + ", leaf: " + element.isLeaf());
        Enumeration attrNames = attrSet.getAttributeNames();
        while (attrNames.hasMoreElements()) {
            Object attr = attrNames.nextElement();
            System.out.println("  Attribute: '" + attr + "', Value: '"
                + attrSet.getAttribute(attr) + "'");
            Object tag = attrSet.getAttribute(StyleConstants.NameAttribute);
            if (attr == StyleConstants.NameAttribute
                && tag == HTML.Tag.CONTENT) {
                int startOffset = element.getStartOffset();
                int endOffset = element.getEndOffset();
                int length = endOffset - startOffset;
                System.out.printf("    Content (%d-%d): '%s'\n", startOffset,
                    endOffset, htmlDoc.getText(startOffset, length).trim());
            }
        }
    }
}
Community
  • 1
  • 1
trashgod
  • 203,806
  • 29
  • 246
  • 1,045
  • I'm having trouble understanding your example, and I don't see an easy way to test it. It might help to prepare an [sscce](http://sscce.org/) that reproduces the problem; this [example](http://stackoverflow.com/questions/2883413) might be a suitable starting point. – trashgod Apr 10 '11 at 20:10
  • I've uploaded a sscce (see edit). The example you mention isn't really different except for htmlKit.read(), which I did try before writing. I know that my example outputs "a lot", but not all. Look for the word "bored" which is in the html file but not in the program output. The println(element) does not output the text between html tags (this is part of the 'LeafElement(content)' elements), which is why I wrote my own printElement() function, which does. – Tilman Hausherr Apr 10 '11 at 21:05
  • Although I don't have `xenulink.html`, I see what you mean. Using `read()` seems to work, as shown above. – trashgod Apr 10 '11 at 21:44
  • The file is linked from my profile. I didn't want to turn this into an advertisement for my website. But it happens with other websites as well, at least when they have more than a certain size. – Tilman Hausherr Apr 10 '11 at 22:04
  • 1
    Yes, `read()` on your html file throws `ChangedCharSetException`, but any _other_ approach seems to miss content. – trashgod Apr 11 '11 at 02:25
  • Oops, you are correct, read() did work - I tried so many alternatives yesterday that I misremembered what worked and what didn't. This morning I reactivated that part of my code and yes, that worked. The exception is avoided by using htmlDoc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); But I'd still like to find out why parser.parse() doesn't do the thing, because the callback functions are interesting too. – Tilman Hausherr Apr 11 '11 at 06:12
  • Sadly, I recall it not working and proposing the code cited above. I'd forgotten about the property. Thanks! – trashgod Apr 11 '11 at 06:45