Your subject specifies "using regex," but that's probably a really bad approach. Even if you got something to work, it would probably end up being very fragile - meaning that seemingly insignificant (and perfectly legal, from an HTML point of view) changes to the input would cause your code to fail. And handling all the syntactical complexities in XML (and hence in HTML) could be a nightmare. E.g. attribute values can be quoted with single or double quotes; character entities (like """ can appear in attribute values or element text; element text can appear in CDATA form; etc.
A much more reliable approach is to use one of the XML parsing solutions available in the javax.xml package. You have several choices, and any of them can be used as the basis for a robust solution to your problem.
One simple approach is to use a combination of org.w3c.dom.Document
and javax.xml.xpath.XpathExpression
. With the former your XML is parsed and you end up with its full contents in a navigable object of type Document
. You could navigate that directly to find the data you're looking for, but you can also use XPathExpression
s to do the searching for you.
This approach may not be practical if your input document can be very large. In that case you might look into org.xml.sax
package, which provides a streaming XML parser. You won't be able to use XPaths with that, but the handler you'd have to write should be quite easy for your problem.
Here's code using the Document
/ XPathExpression
approach. If you save your HTML snippet (with incorrect "<div/>"
replaced with "</div>"
in a few places and wrapped in "<html><body>...</body></html>"
) in a file named "foo.html" alongside the Test.class file, you should be able to run it successfully.
package test;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
public class Test {
public static void main(String[] argv) throws XPathExpressionException, SAXException, IOException, ParserConfigurationException {
XPathFactory fac = XPathFactory.newInstance();
XPathExpression idDivExpr = fac.newXPath().compile("//div[@class='list']");
XPathExpression timeExpr = fac.newXPath().compile("div[@class='time']");
XPathExpression subjExpr = fac.newXPath().compile("div[@class='subject']");
InputStream in = Test.class.getResourceAsStream("foo.html");
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
NodeList nl = (NodeList) idDivExpr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nl.getLength(); i++) {
Element elt = (Element) nl.item(i);
System.out.printf("%s|%s|%s\n",
elt.getAttribute("id"),
timeExpr.evaluate(elt),
subjExpr.evaluate(elt));
}
}
}