Match specific html attribute values

Question

I would like to match all attribute values for id, class, name and for! I created a simple function for that task.

private Collection<String> getAttributes(final String htmlContent) {
    final Set<String> attributes = new HashSet<>();
    final Pattern pattern = Pattern.compile("(class|id|for|name)=\\\"(.*?)\\\"");
    final Matcher matcher = pattern.matcher(htmlContent);
    while (matcher.find()) {
        attributes.add(matcher.group(2));
    }
    return attributes;
}

Example html content:

<input id="test" name="testName" class="aClass bClass" type="input" />

How can I split html classes via regular expression, so that I get the following result set:

test
testName
aClass
bClass

And is there any way to improve my code? I really don't like the loop.

SME_Dev · Answer 1 · 2016-02-18T16:57:54.373

If you take a look at the JSoup library you can find useful tools for html parsing and manipulation.

For example:

Document doc = ...//create HTML document
Elements htmlElements = doc.children();
htmlElements.traverse(new MyHtmlElementVisitor());

The class MyHtmlElementVisitor simply has to implement NodeVisitor and can access the Node attributes.

Though you might find a good regex for the same job, it has several drawbacks. Just to name a few:

hard to find a failsafe regex for every possible html document
hard to read, therefore difficult to find bugs and implement changes
the regex usually isn't reusable

score 0 · Accepted Answer · answered Feb 18 '16 at 17:37

Don't use regular expressions for parsing HTML. Seriously, it's more complicated than you think.

If your document is actually XHTML, you can use XPath:

XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(
    "//@*["
        + "local-name()='class'"
        + " or local-name()='id'"
        + " or local-name()='for'"
        + " or local-name()='name'"
    + "]",
    new InputSource(new StringReader(htmlContent)),
    XPathConstants.NODESET);
int count = nodes.getLength();
for (int i = 0; i < count; i++) {
    Collections.addAll(attributes,
        nodes.item(i).getNodeValue().split("\\s+"));
}

If it's not XHTML, you can use Swing's HTML parsing:

HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
    private final Object[] attributesOfInterest = {
        HTML.Attribute.CLASS,
        HTML.Attribute.ID,
        "for",
        HTML.Attribute.NAME,
    };

    private void addAttributes(AttributeSet attr) {
        for (Object a : attributesOfInterest) {
            Object value = attr.getAttribute(a);
            if (value != null) {
                Collections.addAll(attributes,
                    value.toString().split("\\s+"));
            }
        }
    }

    @Override
    public void handleStartTag(HTML.Tag tag,
                               MutableAttributeSet attr,
                               int pos) {
        addAttributes(attr);
        super.handleStartTag(tag, attr, pos);
    }

    @Override
    public void handleSimpleTag(HTML.Tag tag,
                                MutableAttributeSet attr,
                                int pos) {
        addAttributes(attr);
        super.handleSimpleTag(tag, attr, pos);
    }
};

HTMLDocument doc = (HTMLDocument)
    new HTMLEditorKit().createDefaultDocument();
doc.getParser().parse(new StringReader(htmlContent), callback, true);

As for doing it without a loop, I don't think that's possible. But any implementation is going to use one or more loops internally anyway.

Match specific html attribute values

2 Answers2