Does sorting help efficiency of grep in Perl

Question

I'm looking for some specifics about how Perl's grep function works. I'm doing this:

if ( grep{ $foo == $_ } @bar ) {
  some code;
}

Suppose @bar is large (hundreds of thousands of elements). With my data, if I sort @bar, values of $foo are more likely to appear near the beginning of the array than near the end. I'm wondering if this will help performance.

Put differently, with the above code, does grep move sequentially through @bar checking whether $foo == $_ and then immediately exit once any value has been found to be true? Or would it actually check every element of @bar before returning a value?

good question. I think, that you need to use not `grep`, but `for ()` for your early exit — gaussblurinc, Mar 19 '13 at 18:20
And see my comment below. `List::MoreUtils` on CPAN can do what you want, if I right understand your task — gaussblurinc, Mar 19 '13 at 18:25
@loldop You should put this as an answer. Seems like `firstidx` would do what I want. — itzy, Mar 19 '13 at 18:33
@user1937198: no, in the demonstrated usage, it returns a *count* of elements where the condition was true, and it could be optimized to stop early in boolean context if there were not the possibility of side effects — ysth, Mar 19 '13 at 18:33
note that firstidx will be *much slower* than grep; only use it if grep is taking too long and you can benchmark an improvement — ysth, Mar 19 '13 at 18:34
@itzy Will you be performing more tests? If so, creating a hash might be a good idea. — TLP, Mar 19 '13 at 18:39

ikegami · Accepted Answer · 2013-03-20T01:47:58.123

10

grep does not short-circuit, so ordering of elements does not matter.

While List::MoreUtils's first does short-circuit, the whole list must be placed on the stack before it's called.

This will be best:

for (@bar) {
   if ($foo == $_) {
      some code;
      last;
   }
}

Updated: I originally iterated over the indexes since that uses O(1) memory, but so does for (@bar) (as opposed to for (LIST) in general) as ysth reminded me.

edited Mar 20 '13 at 01:47

answered Mar 19 '13 at 18:37

ikegami

367,544
15
269
518

1

Are you sure using indexes will be faster than using the elements directly? – TLP Mar 19 '13 at 18:48
@TLP, Won't put the array on the stack. Memory wise, yes. Speed, maybe a itsy tiny bit slower from the singular extra op. – ikegami Mar 19 '13 at 18:49
2

for on an array won't put the array on the stack – ysth Mar 19 '13 at 19:08
`first` is very likely to be faster even with the stack copy. Memory could be an issue, but "if it fits in memory once, it will fit twice" is a good enough first approximation. – hobbs Mar 19 '13 at 21:14
@hobbs, Why would doing more be faster?! It does the same thing plus a whole bunch of function calls. – ikegami Mar 20 '13 at 01:18
@ikegami : I have not heard of arrays being moved to stack during some operations. Can you point me to some link which will help in knowing more abt this? – Guru Mar 20 '13 at 06:02
@Guru, Arrays evaluate to a list of their elements in scalar context. Surely that's in perldata? `for (@a)` is special-cased to pass a reference to `@a` instead of creating a list as documented. This optimisation is not documented. – ikegami Mar 20 '13 at 06:16
`$comment[$this-1] =~ s/scalar context/list context/;` – ikegami Mar 20 '13 at 07:59

score 6 · Answer 2 · answered Mar 19 '13 at 18:20

6

Since your usage of grep is in scalar context, it is returning the number of matching elements. To calculate this, Perl has to visit each element anyway, so it's unlikely that sorting would help performance from this angle.

answered Mar 19 '13 at 18:20

sidyll

57,726
14
108
151

Maybe [List::MoreUtils](https://metacpan.org/module/List::MoreUtils) can help in this situation? Not sure – gaussblurinc Mar 19 '13 at 18:23
@loldop I think so! I'm not particularly familiar with this module, but it seems to be very well constructed and List::MoreUtils::any would be useful here. – sidyll Mar 19 '13 at 18:28
1

`grep` checks all of the list elements *regardless* of context. It simply returns a list in list context and a count in scalar context. `first` and similar are a bad choice here as the entire list is copied to the parameters of the function, which is a big deal for hundreds of thousands of scalars. The best solution is a simple `for` loop with a `last` once the condition has been satisfied. A loop like that will terminate sooner if the match is near the beginning of the list, and if it can be detected that the loop has gone beyond any potential match. – Borodin Mar 19 '13 at 19:46

score 2 · Answer 3 · answered Mar 19 '13 at 20:29

In your example grep will iterate whole array regardless how many elements matched.

If you are able to keep this array sorted - its faster to search for your values using binary search. Also you can convert your array into hash (with keys = array element) and do your checks with constant time, but this will eat additional memory.

Borodin · Answer 4 · 2013-03-20T12:03:45.517

1

With regard to your question

With my data, if I sort @bar, values of $foo are more likely to appear near the beginning of the array than near the end. I'm wondering if this will help performance.

If the list is sorted in numerical order then you can write

sub contains {
  my ($list, $item) = @_;
  for (@$item) {
    return $_ == $item if $_ >= $item;
  }
  return !1;
}

some_code() if contains(\@bar, $foo);

edited Mar 20 '13 at 12:03

answered Mar 19 '13 at 19:52

Borodin

126,100
9
70
144

1

If the list is sorted you can use bsearch and find result way faster. Also, `return;` returns empty list or undef, while `return $_ == $item` returns 1 or "", so this sub has 4 different returning values, thats bad. – Galimov Albert Mar 19 '13 at 20:24
If the list is sorted, you can use a binary search. – ikegami Mar 20 '13 at 06:26
@ikegami: Yes, that is true too of course. PSIAlt also felt he had to point this out. The OP asked whether sorting the data would improve the performance of `grep`, and my answer is *“No, but it would help for a hand-coded equivalent like this (and many others)”* – Borodin Mar 20 '13 at 12:01
And I'm saying your answer is: "You're having performance issues, so why don't you use something that might not be faster at all when there exists something much much faster." – ikegami Mar 20 '13 at 16:48
@ikegami: No, you are very wrong again. I am more than happy with my answer as it is. – Borodin Mar 20 '13 at 17:04

score 0 · Answer 5 · edited May 23 '17 at 10:32

It depends. A grep { $x == $_ } @a does not benefit from branch prediction, but grep { $x < $_ } @a does!

#!/usr/bin/env perl

use strict;
use warnings;

use Time::HiRes qw( clock_gettime CLOCK_MONOTONIC );

use constant MIN => 0;
use constant MAX => 1000;
use constant AVG => int(MIN  + (MAX - MIN) / 2);
use constant N_LOOPS => 40000;
use constant ARRAY_LEN => 10000;

## is grep faster for sorted arrays?

##
## RANDOM ARRAY VALUES
##
my $n = 0;
my @a = map { int(rand() * (MAX - MIN) + MIN) } 1 .. ARRAY_LEN;
my $duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

##
## PREDICTABLE ARRAY VALUES
##
$n = 0;
@a = sort {$a <=> $b} @a;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

## and now we try to eliminate side effects by repeating

##
## RANDOM ARRAY VALUES
##
$n = 0;
@a = map { int(rand() * (MAX - MIN) + MIN) } 1 .. ARRAY_LEN;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}   
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/; 

##
## PREDICTABLE ARRAY VALUES
##
$n = 0;
@a = sort {$a <=> $b} @a;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}   
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

The results:

duration: 27.7465513650095 secs, n = 199880000 <-- unsorted
duration: 26.129752348992 secs, n = 199880000  <-- sorted
duration: 28.3962040760089 secs, n = 202920000 <-- unsorted
duration: 26.082420132996 secs, n = 202920000  <-- sorted

See also Why is it faster to process a sorted array than an unsorted array?

As ikegami points out, `grep` will walk over every item in the list. It is independent of the conditional. — Zaid, Mar 20 '13 at 08:29
Nope. The first argument to grep decides which side of the branch *inside grep* will be taken: add the element to the result list or not. — user1050755, Mar 20 '13 at 08:47

Does sorting help efficiency of grep in Perl

5 Answers5