Java regular expression transform HTML list to text

Question

I have data in the form:

<ol>
<li>example1</li>
<li>example2</li>
<li>example3</li>
</ol>

which needs to turn into

# example1
# example2
# example3

The pound sign has to be associated with the ol html tag. I'm using java regular expressions and this is what I have so far:

info = info.replaceAll("(?s).<ol>\n(<li>(.*?)</li>\n)*</ol>","# $2");

info is a string object containing the data. Also there may be line breaks in between the li tags.When I run it, it only prints the last item. i.e the result is

 # example3

example2 and example1 are missing

Any thoughts on what I'm doing wrong?

[Don't use RegEx to parse XML/HTML tags...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Nightfirecat, Jul 13 '11 at 22:49
There are a number of examples of related questions here on SO. — Clockwork-Muse, Jul 13 '11 at 22:53
@Nightfirecat - despite the usual protestations on SO about html and regex's in this instance it would seem perfectly reasonable. — Joel, Jul 13 '11 at 23:06
Well, assuming it's all properly formatted, it's entirely possible (and even somewhat reasonable), but it's still not what RegEx is meant to do. — Nightfirecat, Jul 13 '11 at 23:08
Obviously, you need to make a practical decision between the required level of robustness vs readability vs simplicity of code etc. I like my version: it's a few lines of readable code. It won't work in a few corner cases. The XPath infrastructure is cumbersome for this simple requirement, and also liable to fall flat on its arse if the XML isn't well-formed. — Neil Coffey, Jul 13 '11 at 23:41

score 1 · Accepted Answer · answered Jul 13 '11 at 22:57

Your regex has a couple of problems:

it contains a capturing group inside a capturing group
overall, it will only match once (it includes for a start -- there's only one of these.

The solution I'd recommend: don't tie yourself in knots. Write a loop with a Matcher.find(), pulling out the matches one by one and adding them to a string buffer. It would go something like this:

    Pattern p = Pattern.compile("<ol>(.*?)</ol>");
    Matcher m = p.matcher("...");
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        sb.append("#").append(m.group(1)).append("\n");
    }
    String result = sb.toString();

score 1 · Answer 2 · answered Jul 13 '11 at 23:12

I would argue you can achieve a more robust solution using XPath and Java's document parser, as follows:

import java.io.ByteArrayInputStream;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class Foo {

    public static void main(String[] args) throws Exception {
        final String info = "<html>\n<body>\n<ol>\n<li>example1</li>\n<li>exmaple2</li>\n<li>example3</li>\n</ol>\n</body>\n</html>";
        final Document document = parseDocument(info);
        final XPathExpression xPathExpression = getXPathExpression("//ol/li");
        final NodeList nodes = (NodeList) xPathExpression.evaluate(document, XPathConstants.NODESET);

        // Prints # example1\n# exmaple2\n# example3
        for (int i = 0; i < nodes.getLength(); i++) {
            final Node liNode = nodes.item(i);
            if (liNode.hasChildNodes()) {
                System.out.println("# " + liNode.getChildNodes().item(0).getTextContent());
            }
        }
    }

    private static Document parseDocument(final String info) throws Exception {
        final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(true);
        final DocumentBuilder builder = factory.newDocumentBuilder();
        return builder.parse(new ByteArrayInputStream(info.getBytes("UTF-8")));
    }

    private static XPathExpression getXPathExpression(final String expression) throws Exception {
        final XPathFactory factory = XPathFactory.newInstance();
        final XPath xpath = factory.newXPath();
        return xpath.compile(expression);
    }
}

score 0 · Answer 3 · edited Oct 16 '22 at 22:47

0

EDIT: fixing the <ul> problem mentioned by hoipolloi with this look ahead:

(?=((?!</ul>)(.|\n))*</ol>)

This one worked on your example:

info.replaceAll(
    "(?:<ol>\s*)?<li>(.*?)</li>(?=((?!</ul>)(.|\n))*</ol>)(?:\s*</ol>)?",
    "# $1"
);

(?:<ol>\s*)?

If it exists, match <ol> plus anything whitespace following it. The (?: means don't capture this group.

<li>(.*?)</li>

Match an <li>anything</li>. And capture the anything in the first group. The *? means match any length, non-greedily, (i.e. match the first </li> after the <li>.)

New clause (?=((?!</ul>)(.|\n))*</ol>)
- Ensure that an </ol> follows this <li> before a </ul>
(?:\s*</ol>)?

And match any trailing whitespace plus </ol>.

edited Oct 16 '22 at 22:47

miken32

42,008
16
111
154

answered Jul 13 '11 at 22:55

Jacob Eggers

9,062
2
25
43

I gotta say, that's pretty clever. +1 – Ray Toal Jul 13 '11 at 23:00
Congratulations for figuring this out. On the other hand, unless the code is an entry for the Sadomasochist Society Competition For Squashing The Most Code Into A Single Line, I'd really write it out (e.g. as I suggest above) for the sake of easier maintenance and comprehensibility. – Neil Coffey Jul 13 '11 at 23:01
@Neil I don't disagree with you. regex is not for the faint of heart, and java isn't the best tool for developing regex. – Jacob Eggers Jul 13 '11 at 23:06
@ Jacob - This does not meet the requirement "The pound sign has to be associated with the ol html tag" as it will also replace children of the
– hoipolloi Jul 13 '11 at 23:18
@Jacob - In fact, I'm not sure what value the optional (?:
)? captures are offering at all.
– hoipolloi Jul 13 '11 at 23:37
@hoipolloi, 1. You're right, I didn't think of `
)?`) were put there to remove the preceding and trailing `
` tags of the list.
– Jacob Eggers Jul 13 '11 at 23:43
@hoipolloi - I've fixed the `
– Jacob Eggers Jul 14 '11 at 00:21
@Jacob - "optional groups ... were put there to remove the ...
tags" Ah, I see how that works now. Thanks for clarifying :)
– hoipolloi Jul 14 '11 at 01:34

score 0 · Answer 4 · answered Jul 13 '11 at 22:55

The answer to "what you are doing wrong" is that you are replacing the entire single regex (which matches from ol all the way to /ol) with the value of your second group. The second group was in a repeated fragment, so the result of $2 was the last match of that group.

score 0 · Answer 5 · answered Jul 13 '11 at 23:41

I would use a simpler solution instead of a complex regex. For ecample:

    Scanner scann= new Scanner(str); //the parameter can be a file or an inputstream 
    scann.useDelimiter("</?ol>");
    while (scann.hasNext())
    {
        str = scann.next();
        str = str.replaceAll("<li>(.*?)</li>\n*","# $1" +
                "\n"); //$NON-NLS-1$ //$NON-NLS-2$
    }

score 0 · Answer 6 · answered Jul 14 '11 at 09:16

Don't use regular expressions for parsing XML/HTML. Full stop. You'll never handle all the possible variations that can legally occur in the input, and you'll forever be telling people who supply the content that you're sorry, you can only handle a restricted subset of XML/HTML, and they will forever be cursing you. And if you do get to the point where you can handle 99% of legal input, your code will be unmaintainable and slow.

There are off-the-shelf parsers to do this job - use them.

score 0 · Answer 7 · answered Jul 14 '11 at 12:26

info = info.replaceAll("(?:<ol>|\\G)\\s*<li>(.+?)</li>(?:\\s*</ol>)?",
                       "# $1\n");

(?:<ol>|\G) ensures that each bunch of matches starts either with <ol> or where the last match left off, so it can never start matching inside a <ul> element.

Java regular expression transform HTML list to text

7 Answers7