Pattern optimization

Question

I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?

Response:

...
<div class="ui-a">
<div class="ui-b">
    <p><strong>foo</strong></p>
    <p>bar</p>
</div>
<div class="ui-c">
    <p><strong>bla</strong></p>
    <p>...</p>
</div>
</div>

<div class="ui-a">
<div class="ui-b">
    <p><strong>foo1</strong></p>
    <p>bar1</p>
</div>
<div class="ui-c">
    <p><strong>bla1</strong></p>
    <p>...</p>
</div>

Pattern:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?

Also if you are not in control of the HTML you are parsing there can be line breaks and whitespace in the tags which will make your Pattern fail to match. See http://jsfiddle.net/qhAPa/ — Stefan, Nov 10 '11 at 21:54

score 2 · Accepted Answer · answered Nov 10 '11 at 21:59

Since you can't make use of an HTML parser, try something like this:

import java.util.regex.*;

public class Main {
    public static void main (String[] args) {
        String html =
                "...\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo</strong></p>\n" +
                "    <p>bar</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>\n" +
                "</div>\n" +
                "\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo1</strong></p>\n" +
                "    <p>bar1</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla1</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>";

        Pattern p = Pattern.compile(
                "(?sx)                               # enable DOT-ALL and COMMENTS     \n" +
                "<div\\s+class=\"ui-a\">             # match '<div...ui-a...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n" +
                "(?:(?!<p>).)*+                      # match up to <p>                 \n" +
                "<p>([^<>]++)</p>                    # match <p>...</p>                \n" +
                "(?:(?!<div\\s+class=\"ui-c\">).)*+  # match up to '<div...ui-a...>'   \n" +
                "<div\\s+class=\"ui-c\">             # match '<div...ui-c...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n"
        );

        Matcher m = p.matcher(html);

        while(m.find()) {
            System.out.println("---------------");
            for(int i = 1; i <= m.groupCount(); i++) {
                System.out.printf("group(%d) = %s\n", i, m.group(i));
            }
        }
    }
}

which will print the following to the console:

---------------
group(1) = foo
group(2) = bar
group(3) = bla
---------------
group(1) = foo1
group(2) = bar1
group(3) = bla1

Note my changes:

*+ and ++: http://www.regular-expressions.info/possessive.html
instead of .*?, I used (?:(?!...).)*+. The first, .*? will keep track of all possible matches it makes to be able to back-track at a later stage. The latter, (?:(?!...).)*+, will not keep track of these matches.

That should make it quicker (not sure by how much...).

perfect, your pattern is really great. The execution time before was 200ms and now 35ms. — CannyDuck, Nov 10 '11 at 22:18
In the last two pattern lines you wrote: "(?:(?!).)*+"([^<>]++)". So I have a defined suffix in the last strong tag. Any idea why this change wont work: "(?:(?!).)*+"([^<>]++)suffix" — CannyDuck, Nov 10 '11 at 22:42
@CannyDuck, Because `[^<>]++` is possessive and once it matches suffix, it will never back-track, causing the pattern to fail. If you have a large suffix, match un-greedy: `([^<>]+?)suffix` and if it's a small suffix, match greedy: `([^<>]+)suffix` (but not possessive! So drop the extra `+`). Take your time to read through regular-expressions.info/possessive.html which explains the mechanism of possessive matching much better than I can (especially in these small comment-boxes!) :) — Bart Kiers, Nov 10 '11 at 23:22
@CannyDuck, oh, and all the `\\s+` occurrences can be replaced by `\\s++`. — Bart Kiers, Nov 11 '11 at 09:26

score 1 · Answer 2 · answered Nov 10 '11 at 21:07

1

Seems, what you are looking for is between tag only, you can work with:

<strong>([a-zA-Z0-9]+)</strong>

further, depending on what comes inside strong tag, you can change the pattern e.g. if you are sure that the text is always small case you can remove A-Z from above pattern or if it contains only 4 characters you can use a {4} after the pattern.

answered Nov 10 '11 at 21:07

Ravi Bhatt

3,147
19
21

1

The matching is slower, if using your advice. I tried it: my pattern (200ms) modified with you adivce (300ms) – CannyDuck Nov 10 '11 at 21:36

score 0 · Answer 3 · answered Nov 10 '11 at 21:10

0

All your strings into <p> tag, so you can search what it contain (and remove <strong>). But may be better if you use parser and not regex. Search all <p>; If <p> has childNode then get <p>.text; else get <p>.text.

answered Nov 10 '11 at 21:10

Zernike

1,758
1
16
26

score 0 · Answer 4 · edited May 23 '17 at 11:55

0

Consider using JSoup instead. There are some well-known problems with using regular expressions to parse HTML.

edited May 23 '17 at 11:55

Community

1
1

answered Nov 10 '11 at 21:11

Ed Staub

15,480
3
61
91

Good idea, but not possible in my case, because the lib is to large for an android project. – CannyDuck Nov 10 '11 at 21:27

score 0 · Answer 5 · answered Nov 10 '11 at 21:37

If you aren't relying on the regex to validate the html and you don't have permission to modify the structure of the html. Also, getting rid of the last .*? is necessary because the first one will be in conflict for subsequent matches. Essentially you have .*?.*? because the engine will attempt to find all possible permutations of all characters between the last <strong> tag and the next <div class="ui-a"> tag. Very inefficient. Try this:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>

On a side note: Are you sure you want to find the first <strong> tag inside of <div class="ui-a"> because the first <strong> tag appears to happen inside <div class="ui-b"> in which case this:

.*?<div class="ui-b">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>

is more accurate.

If you know there are no nested tags in the capture groups you want, you can further optimize it with:

.*?<div class="ui-b">.*?<strong>([^<]*)</strong>.*?<p>([^<]*)</p>.*?</div>.*?<div class="ui-c">.*?<strong>([^<]*)</strong>

Thanks for your advice. I tried all of your versions. #3 is the fasted one, but I thought there is more space for optimization, your implementation is less than 10 % faster. — CannyDuck, Nov 10 '11 at 21:49

Mike Ryan · Answer 6 · 2011-11-10T22:07:26.747

Your regex has both leading and training .*? I don't understand why. And if the data is well-formatted you really just mean a certain amount of whitespace, yes? Why wouldn't it be:

Pattern p = "<div class=\"ui-b\">\s*<p><strong>([^<]*)</strong></p>\s*<p>([^<]*)</p>\s*</div>\s*<div class=\"ui-c\">\s*<p><strong>([^<]*)</strong></p>";
Matcher m = p.matcher(responseText);

while (m.find()) {
   String foo = m.group(1);
   String bar = m.group(2);
   String bla = m.group(3);

   /* do whatever w/ foo, bar, bla */
}

Where I've dropped out all your .*?

and replaced the inner ones with whitespace (or is there more there that you're leaving out for example -- perhaps). But regardless, why would you need the beginning and end .*?

If it is well-formatted having it just do whitespace searches should increase it substantially.

Pattern optimization

6 Answers6