Matching repeating HTML pattern using Java regex

Question

May be anyone have asked this question earlier, but I couldn't find a solution so posting this question.

I need to parse the below HTML string to find id, time and subject for each item:

<div class="list" id="1">
  <div class="time">12:01 PM</div>
  <div class="subject">[This is dummy Subject1] This is some dummy strings after subject</div>
<div/>
<div class="list" id="2">
  <div class="time">12:01 PM</div>
  <div class="subject">[This is dummy Subject2] This is some dummy strings after subject</div>
<div/>
<div class="list" id="3">
  <div class="time">12:01 PM</div>
  <div class="subject">[This is dummy Subject3] This is some dummy strings after subject</div>
<div/>

The output needs to be like: id|time|subject.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 — ug_, Mar 24 '15 at 11:17
Whats the format of ID? Number? Whats the format of Time? Whats the format of Subject? any String? If you don't have a delimiter to separate id|time|subject from the rest, this will be very complicated. — Alexander, Mar 24 '15 at 11:30
@Alexander: ID: ID-{numeric}, Time: 12:01 PM/AM, Subject: Any character inside []. For ID div the class always will be list, for time the class will be "time" and for subject the class will be subject. — Roul, Mar 24 '15 at 11:31
Couldn't you use one of the many DOM parsers to get all lists from your document and parse the values afterwards using appropiate means? — Rangad, Mar 24 '15 at 11:34
@Rangad: I need to do it using regex. DOM parser not allowed :( . — Roul, Mar 24 '15 at 11:36

score 1 · Answer 1 · answered Mar 24 '15 at 13:46

1

see here a demo https://regex101.com/r/fN1fZ0/1

var re = /.*?id="(.*?)".*?time">(.*?)<\/.*?subject">\[(.*?)\].*?|.*$/gs; 
var subst = '$1|$2|$3\n'; 

var result = str.replace(re, subst);

answered Mar 24 '15 at 13:46

Vladu Ionut

8,075
1
19
30

score 0 · Accepted Answer · answered Mar 24 '15 at 17:17

Your subject specifies "using regex," but that's probably a really bad approach. Even if you got something to work, it would probably end up being very fragile - meaning that seemingly insignificant (and perfectly legal, from an HTML point of view) changes to the input would cause your code to fail. And handling all the syntactical complexities in XML (and hence in HTML) could be a nightmare. E.g. attribute values can be quoted with single or double quotes; character entities (like """ can appear in attribute values or element text; element text can appear in CDATA form; etc.

A much more reliable approach is to use one of the XML parsing solutions available in the javax.xml package. You have several choices, and any of them can be used as the basis for a robust solution to your problem.

One simple approach is to use a combination of org.w3c.dom.Document and javax.xml.xpath.XpathExpression. With the former your XML is parsed and you end up with its full contents in a navigable object of type Document. You could navigate that directly to find the data you're looking for, but you can also use XPathExpressions to do the searching for you.

This approach may not be practical if your input document can be very large. In that case you might look into org.xml.sax package, which provides a streaming XML parser. You won't be able to use XPaths with that, but the handler you'd have to write should be quite easy for your problem.

Here's code using the Document / XPathExpression approach. If you save your HTML snippet (with incorrect "<div/>" replaced with "</div>" in a few places and wrapped in "<html><body>...</body></html>") in a file named "foo.html" alongside the Test.class file, you should be able to run it successfully.

package test;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import java.io.IOException;
import java.io.InputStream;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;



public class Test {

  public static void main(String[] argv) throws XPathExpressionException, SAXException, IOException, ParserConfigurationException {
    XPathFactory fac = XPathFactory.newInstance();
    XPathExpression idDivExpr = fac.newXPath().compile("//div[@class='list']");
    XPathExpression timeExpr = fac.newXPath().compile("div[@class='time']");
    XPathExpression subjExpr = fac.newXPath().compile("div[@class='subject']");
    InputStream in = Test.class.getResourceAsStream("foo.html");
    Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
    NodeList nl = (NodeList) idDivExpr.evaluate(doc, XPathConstants.NODESET);
    for (int i = 0; i < nl.getLength(); i++) {
      Element elt = (Element) nl.item(i);
      System.out.printf("%s|%s|%s\n",
          elt.getAttribute("id"),
          timeExpr.evaluate(elt),
          subjExpr.evaluate(elt));
    }
  }
}

Matching repeating HTML pattern using Java regex

2 Answers2