profiling a simple regex search

Question

I want to test the performance of two different approaches, in perl, of checking that one string is contained entirely within another.

The first approach is to take a string convert it to an array and test character by character whilst the second approach simply evaluates a regular expression (which I believe has the same order as a linear search through all the characters but doesn't incur the cost of assigning memory for an array, and copying characters into it (though it might have other costs involved)).

My initial approach to doing this test was to just stick both procedures (see below) in a big for loop (0 to 999999) and then time how long it takes for the program to finish; and at first it looked as though a regex match was much faster (12.926s vs 0.318s); but I then considered the possibility that upon evaluating the regex once the following iterations are trivial because it is cached. To test this I instead put my for loop on the command line (making each iteration of the perl script looping through 0 to 0 "memory-less") and noticed that they are both similar (albeit with some wild divergence from the average at times). But I strongly suspect this might be a poor conclusion because the time taken to start the script probably dominates the execution time of the script.

Is there a way (especially for when I want to look at something less trivial), of turning off the caching (if that's what is happening of course) so that I can fairly run procedures within a for loop (so I can call the script only once)?

Or is it the case that there is nothing clever going on and that a regex search really is much quicker in this example!?

my $teststr = "testing testing";
my $teststr2 = "testing tasted";
my $match = 1;

#procedure one - character by character checking
for (my $i = 0; $i < 1; $i++)
{
    my @chrArr = split //, $teststr;
    my @chrArr2 = split //, $teststr2;
    for (my $j = 0; $j < @chrArr2; $j++)
    {
        if($chrArr[$j] != $chrArr2[$j])
        {
            $match = 0;
            break;
        }
    }
}

#procedure 2 - regex matching
for (my $i = 0; $i < 1; $i++)
{
    if($teststr !~ m/$teststr2/)
    {
        $match = 0;
    }
}

You should take a look at the [`Benchmark`](https://metacpan.org/module/Benchmark) module. — Borodin, Dec 14 '13 at 22:26
Play finished for now (nice tool!)...the timings of the two approaches (n=10^7) are 5.11s for char-by-char (including array copy) and 2.99 for regex. This is more in line with what I expected (the factor of forty in my first test looked too good) — HexedAgain, Dec 14 '13 at 23:44

score 1 · Answer 1 · answered Dec 15 '13 at 17:58

1

Why don't you use the Banchmark module. It should fit perfectly here.

use Benchmark qw( timethese cmpthese);

-- cmic

answered Dec 15 '13 at 17:58

Thanks, this was previously suggested in a comment, and I've had a play around with it - It's a nice tool. – HexedAgain Dec 15 '13 at 19:36

score 0 · Answer 2 · answered Dec 14 '13 at 21:20

0

Regular expression matching/searching is linear. Compiling the pattern is expensive. If you change $teststr2 on every iteration, no caching will be possible. For example:

#procedure 2 - regex matching
for (my $i = 0; $i < 1; $i++)
{
    if($teststr !~ m/${i}$teststr2/)
    {
        $match = 0;
    }
}

answered Dec 14 '13 at 21:20

krico

5,723
2
25
28

That's great, thanks for that one. For what it's worth, in this example 1000000 iterations of the regex evaluation takes 2.519s (looks more plausible) compared to 12.926s of a character by character check (for something more complicated where multiple searches dominate the array assignment in the first approach I might see something different) – HexedAgain Dec 14 '13 at 21:27
actually to make the tests fairer this time I made the char by char search only look at the first character in the inner for loop (since I expect the regex will always fail quickly for a null variable). This time the execution time of the first approach took 6.392s (the array assignment will dominate here - indeed pulling the array setup out of the loop execution time was for the first approach 0.598s). – HexedAgain Dec 14 '13 at 21:36
2

@HexedAgain: You appear to be looking for ways to handicap regular expression matching in an attempt to make it as slow as looping over individual characters. In Krico's code, what you are timing includes converting the number `$i` to a string, concatenating two strings, and compiling the regex. Do you really expect to test against a million *different* patterns so that the regex is different each time? Regexes are faster, largely because that is one of the points of using them. The factor of forty that you got in your original test seems about right to me. What else are you trying to prove? – Borodin Dec 14 '13 at 22:24
@krico: Yes, compiling regexes is usually much more time-consuming than the pattern match itself, but what are you hoping to prove here? It is unrealistic to expect every pattern to be different, which is what makes the one-off overhead of compilation worthwhile. If you are adding arbitrary additional work just to slow down the loop then you may as well use `sleep`. – Borodin Dec 14 '13 at 22:30
@krico: Also, regex matching is *not* linear. Try `/a\z/` against any length of string and I bet you get about the same time. – Borodin Dec 14 '13 at 22:32
@Borodin: thanks for that - I overlooked that point. I'm not so much trying to handicap regular expression matching as much as I am trying to determine whether the simplicity of using them might be undermined by their cost (because where I intend to use them they will be used often enough such that cutting cost is important). Given that I do not know how the regex engine works it seemed plausible that the time to evaluate a simple match between two strings should be proportional to the time it takes to actually walk along string - that you suggest this is not the case is very surprising – HexedAgain Dec 14 '13 at 22:40
... to qualify "used often" I envisage regex searching for strings in a worst-case scenario in a large list (a blind regex btw), and then augmenting that list wrt user input and playing the same game again, and so on... – HexedAgain Dec 14 '13 at 22:49
@HexedAgain: Regular expressions are a *compiled language* that is *designed and optimised for matching patterns*. On the other hand, Perl is an *interpreted scripting language* that is *optimised for general-purpose use*. You stand a chance that a C program may be faster than a regular expression for a specific case, but Perl will never come close. That is why Perl has regular expressions built in! And as soon as you have user input in the loop then *nothing else* is slower than the fastest user. – Borodin Dec 14 '13 at 22:58
@Borodin: To give some further context to this one, in one of my previous threads (http://stackoverflow.com/questions/20361498/perl-user-input-dynamic-auto-tab-expansion) I expressed my desire to code up a fast change directory script (and I'm thinking about doing it in C++) where user-input should (if used correctly) come very fast. Some of the work in developing that script was contained in an answer, though fortunately enough is left for me to work on. At any rate, in the case I happen to be in a large directory I don't want any perceptible waiting time - hence my concerns. – HexedAgain Dec 14 '13 at 23:12
But in a large directory, surely I/O wait will dominate over any (reasonable) CPU usage? – tripleee Dec 15 '13 at 18:09
@tripleee: I will only do the costly I/O once per cd into a new directory - the results will be passed into an array, upon which the remaining work comprises finding subsets which contain current input and their left intersections wrt to char entry (to auto expand). I'll also support not just directory navigation (case-insensitive unless forced) but also listing files (if input is ';' or no dirs left - there could be many hundreds - similarly named) and opening them (editor dependent on file-type). There are other features I intend to add also. The non-I/O processing here will not be trivial. – HexedAgain Dec 15 '13 at 19:33

profiling a simple regex search

2 Answers2