3

I have data in the form:

<ol>
<li>example1</li>
<li>example2</li>
<li>example3</li>
</ol>

which needs to turn into

# example1
# example2
# example3

The pound sign has to be associated with the ol html tag. I'm using java regular expressions and this is what I have so far:

info = info.replaceAll("(?s).<ol>\n(<li>(.*?)</li>\n)*</ol>","# $2");

info is a string object containing the data. Also there may be line breaks in between the li tags.When I run it, it only prints the last item. i.e the result is

 # example3

example2 and example1 are missing

Any thoughts on what I'm doing wrong?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
user843614
  • 33
  • 2
  • 6
    [Don't use RegEx to parse XML/HTML tags...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Nightfirecat Jul 13 '11 at 22:49
  • There are a number of examples of related questions here on SO. – Clockwork-Muse Jul 13 '11 at 22:53
  • @Nightfirecat - despite the usual protestations on SO about html and regex's in this instance it would seem perfectly reasonable. – Joel Jul 13 '11 at 23:06
  • Well, assuming it's all properly formatted, it's entirely possible (and even somewhat reasonable), but it's still not what RegEx is meant to do. – Nightfirecat Jul 13 '11 at 23:08
  • Agreed, but if you adopt Neil's approach it's excusable. – Joel Jul 13 '11 at 23:13
  • Obviously, you need to make a practical decision between the required level of robustness vs readability vs simplicity of code etc. I like my version: it's a few lines of readable code. It won't work in a few corner cases. The XPath infrastructure is cumbersome for this simple requirement, and also liable to fall flat on its arse if the XML isn't well-formed. – Neil Coffey Jul 13 '11 at 23:41

7 Answers7

1

Your regex has a couple of problems:

  • it contains a capturing group inside a capturing group
  • overall, it will only match once (it includes for a start -- there's only one of these.

The solution I'd recommend: don't tie yourself in knots. Write a loop with a Matcher.find(), pulling out the matches one by one and adding them to a string buffer. It would go something like this:

    Pattern p = Pattern.compile("<ol>(.*?)</ol>");
    Matcher m = p.matcher("...");
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        sb.append("#").append(m.group(1)).append("\n");
    }
    String result = sb.toString();
Neil Coffey
  • 21,615
  • 7
  • 62
  • 83
1

I would argue you can achieve a more robust solution using XPath and Java's document parser, as follows:

import java.io.ByteArrayInputStream;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class Foo {

    public static void main(String[] args) throws Exception {
        final String info = "<html>\n<body>\n<ol>\n<li>example1</li>\n<li>exmaple2</li>\n<li>example3</li>\n</ol>\n</body>\n</html>";
        final Document document = parseDocument(info);
        final XPathExpression xPathExpression = getXPathExpression("//ol/li");
        final NodeList nodes = (NodeList) xPathExpression.evaluate(document, XPathConstants.NODESET);

        // Prints # example1\n# exmaple2\n# example3
        for (int i = 0; i < nodes.getLength(); i++) {
            final Node liNode = nodes.item(i);
            if (liNode.hasChildNodes()) {
                System.out.println("# " + liNode.getChildNodes().item(0).getTextContent());
            }
        }
    }

    private static Document parseDocument(final String info) throws Exception {
        final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(true);
        final DocumentBuilder builder = factory.newDocumentBuilder();
        return builder.parse(new ByteArrayInputStream(info.getBytes("UTF-8")));
    }

    private static XPathExpression getXPathExpression(final String expression) throws Exception {
        final XPathFactory factory = XPathFactory.newInstance();
        final XPath xpath = factory.newXPath();
        return xpath.compile(expression);
    }
}
hoipolloi
  • 7,984
  • 2
  • 27
  • 28
0

EDIT: fixing the <ul> problem mentioned by hoipolloi with this look ahead:

(?=((?!</ul>)(.|\n))*</ol>)

This one worked on your example:

info.replaceAll(
    "(?:<ol>\s*)?<li>(.*?)</li>(?=((?!</ul>)(.|\n))*</ol>)(?:\s*</ol>)?",
    "# $1"
);

  1. (?:<ol>\s*)?
  • If it exists, match <ol> plus anything whitespace following it. The (?: means don't capture this group.
  1. <li>(.*?)</li>
  • Match an <li>anything</li>. And capture the anything in the first group. The *? means match any length, non-greedily, (i.e. match the first </li> after the <li>.)
  1. New clause (?=((?!</ul>)(.|\n))*</ol>)
    • Ensure that an </ol> follows this <li> before a </ul>
  2. (?:\s*</ol>)?
  • And match any trailing whitespace plus </ol>.
miken32
  • 42,008
  • 16
  • 111
  • 154
Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
  • I gotta say, that's pretty clever. +1 – Ray Toal Jul 13 '11 at 23:00
  • Congratulations for figuring this out. On the other hand, unless the code is an entry for the Sadomasochist Society Competition For Squashing The Most Code Into A Single Line, I'd really write it out (e.g. as I suggest above) for the sake of easier maintenance and comprehensibility. – Neil Coffey Jul 13 '11 at 23:01
  • @Neil I don't disagree with you. regex is not for the faint of heart, and java isn't the best tool for developing regex. – Jacob Eggers Jul 13 '11 at 23:06
  • @ Jacob - This does not meet the requirement "The pound sign has to be associated with the ol html tag" as it will also replace children of the
      tag.
    – hoipolloi Jul 13 '11 at 23:18
  • @Jacob - In fact, I'm not sure what value the optional (?:
      \s*)? and (?:\s*
    )? captures are offering at all.
    – hoipolloi Jul 13 '11 at 23:37
  • @hoipolloi, 1. You're right, I didn't think of `
      ` tags. 2. The optional groups (`(?:
      \s*)?` and `(?:\s*
    )?`) were put there to remove the preceding and trailing `
      `/`
    ` tags of the list.
    – Jacob Eggers Jul 13 '11 at 23:43
  • @hoipolloi - I've fixed the `
      ` problem you mentioned. Though, I'm certainly not suggesting this is a good solution to the OP's problem.
    – Jacob Eggers Jul 14 '11 at 00:21
  • @Jacob - "optional groups ... were put there to remove the ...
      /
    tags" Ah, I see how that works now. Thanks for clarifying :)
    – hoipolloi Jul 14 '11 at 01:34
0

The answer to "what you are doing wrong" is that you are replacing the entire single regex (which matches from ol all the way to /ol) with the value of your second group. The second group was in a repeated fragment, so the result of $2 was the last match of that group.

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
0

I would use a simpler solution instead of a complex regex. For ecample:

    Scanner scann= new Scanner(str); //the parameter can be a file or an inputstream 
    scann.useDelimiter("</?ol>");
    while (scann.hasNext())
    {
        str = scann.next();
        str = str.replaceAll("<li>(.*?)</li>\n*","# $1" +
                "\n"); //$NON-NLS-1$ //$NON-NLS-2$
    }
Govan
  • 2,079
  • 4
  • 31
  • 53
0

Don't use regular expressions for parsing XML/HTML. Full stop. You'll never handle all the possible variations that can legally occur in the input, and you'll forever be telling people who supply the content that you're sorry, you can only handle a restricted subset of XML/HTML, and they will forever be cursing you. And if you do get to the point where you can handle 99% of legal input, your code will be unmaintainable and slow.

There are off-the-shelf parsers to do this job - use them.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0
info = info.replaceAll("(?:<ol>|\\G)\\s*<li>(.+?)</li>(?:\\s*</ol>)?",
                       "# $1\n");

(?:<ol>|\G) ensures that each bunch of matches starts either with <ol> or where the last match left off, so it can never start matching inside a <ul> element.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156