3

I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?

Response:

...
<div class="ui-a">
<div class="ui-b">
    <p><strong>foo</strong></p>
    <p>bar</p>
</div>
<div class="ui-c">
    <p><strong>bla</strong></p>
    <p>...</p>
</div>
</div>

<div class="ui-a">
<div class="ui-b">
    <p><strong>foo1</strong></p>
    <p>bar1</p>
</div>
<div class="ui-c">
    <p><strong>bla1</strong></p>
    <p>...</p>
</div>

Pattern:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?
Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
CannyDuck
  • 359
  • 9
  • 17

6 Answers6

2

Since you can't make use of an HTML parser, try something like this:

import java.util.regex.*;

public class Main {
    public static void main (String[] args) {
        String html =
                "...\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo</strong></p>\n" +
                "    <p>bar</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>\n" +
                "</div>\n" +
                "\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo1</strong></p>\n" +
                "    <p>bar1</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla1</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>";

        Pattern p = Pattern.compile(
                "(?sx)                               # enable DOT-ALL and COMMENTS     \n" +
                "<div\\s+class=\"ui-a\">             # match '<div...ui-a...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n" +
                "(?:(?!<p>).)*+                      # match up to <p>                 \n" +
                "<p>([^<>]++)</p>                    # match <p>...</p>                \n" +
                "(?:(?!<div\\s+class=\"ui-c\">).)*+  # match up to '<div...ui-a...>'   \n" +
                "<div\\s+class=\"ui-c\">             # match '<div...ui-c...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n"
        );

        Matcher m = p.matcher(html);

        while(m.find()) {
            System.out.println("---------------");
            for(int i = 1; i <= m.groupCount(); i++) {
                System.out.printf("group(%d) = %s\n", i, m.group(i));
            }
        }
    }
}

which will print the following to the console:

---------------
group(1) = foo
group(2) = bar
group(3) = bla
---------------
group(1) = foo1
group(2) = bar1
group(3) = bla1

Note my changes:

  • *+ and ++: http://www.regular-expressions.info/possessive.html
  • instead of .*?, I used (?:(?!...).)*+. The first, .*? will keep track of all possible matches it makes to be able to back-track at a later stage. The latter, (?:(?!...).)*+, will not keep track of these matches.

That should make it quicker (not sure by how much...).

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • perfect, your pattern is really great. The execution time before was 200ms and now 35ms. – CannyDuck Nov 10 '11 at 22:18
  • In the last two pattern lines you wrote: "(?:(?!).)*+"([^<>]++)". So I have a defined suffix in the last strong tag. Any idea why this change wont work: "(?:(?!).)*+"([^<>]++)suffix" – CannyDuck Nov 10 '11 at 22:42
  • @CannyDuck, Because `[^<>]++` is possessive and once it matches suffix, it will never back-track, causing the pattern to fail. If you have a large suffix, match un-greedy: `([^<>]+?)suffix` and if it's a small suffix, match greedy: `([^<>]+)suffix` (but not possessive! So drop the extra `+`). Take your time to read through regular-expressions.info/possessive.html which explains the mechanism of possessive matching much better than I can (especially in these small comment-boxes!) :) – Bart Kiers Nov 10 '11 at 23:22
  • @CannyDuck, oh, and all the `\\s+` occurrences can be replaced by `\\s++`. – Bart Kiers Nov 11 '11 at 09:26
1

Seems, what you are looking for is between tag only, you can work with:

<strong>([a-zA-Z0-9]+)</strong>

further, depending on what comes inside strong tag, you can change the pattern e.g. if you are sure that the text is always small case you can remove A-Z from above pattern or if it contains only 4 characters you can use a {4} after the pattern.

Ravi Bhatt
  • 3,147
  • 19
  • 21
  • 1
    The matching is slower, if using your advice. I tried it: my pattern (200ms) modified with you adivce (300ms) – CannyDuck Nov 10 '11 at 21:36
0

All your strings into <p> tag, so you can search what it contain (and remove <strong>). But may be better if you use parser and not regex. Search all <p>; If <p> has childNode then get <p>.text; else get <p>.text.

Zernike
  • 1,758
  • 1
  • 16
  • 26
0

Consider using JSoup instead. There are some well-known problems with using regular expressions to parse HTML.

Community
  • 1
  • 1
Ed Staub
  • 15,480
  • 3
  • 61
  • 91
0

If you aren't relying on the regex to validate the html and you don't have permission to modify the structure of the html. Also, getting rid of the last .*? is necessary because the first one will be in conflict for subsequent matches. Essentially you have .*?.*? because the engine will attempt to find all possible permutations of all characters between the last <strong> tag and the next <div class="ui-a"> tag. Very inefficient. Try this:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>

On a side note: Are you sure you want to find the first <strong> tag inside of <div class="ui-a"> because the first <strong> tag appears to happen inside <div class="ui-b"> in which case this:

.*?<div class="ui-b">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>

is more accurate.

If you know there are no nested tags in the capture groups you want, you can further optimize it with:

.*?<div class="ui-b">.*?<strong>([^<]*)</strong>.*?<p>([^<]*)</p>.*?</div>.*?<div class="ui-c">.*?<strong>([^<]*)</strong>
Eric Siebeneich
  • 353
  • 4
  • 6
  • Thanks for your advice. I tried all of your versions. #3 is the fasted one, but I thought there is more space for optimization, your implementation is less than 10 % faster. – CannyDuck Nov 10 '11 at 21:49
0

Your regex has both leading and training .*? I don't understand why. And if the data is well-formatted you really just mean a certain amount of whitespace, yes? Why wouldn't it be:

Pattern p = "<div class=\"ui-b\">\s*<p><strong>([^<]*)</strong></p>\s*<p>([^<]*)</p>\s*</div>\s*<div class=\"ui-c\">\s*<p><strong>([^<]*)</strong></p>";
Matcher m = p.matcher(responseText);

while (m.find()) {
   String foo = m.group(1);
   String bar = m.group(2);
   String bla = m.group(3);

   /* do whatever w/ foo, bar, bla */
}

Where I've dropped out all your .*?

and replaced the inner ones with whitespace (or is there more there that you're leaving out for example -- perhaps). But regardless, why would you need the beginning and end .*?

If it is well-formatted having it just do whitespace searches should increase it substantially.

Mike Ryan
  • 4,234
  • 1
  • 19
  • 22