-2

This question is very similar to: this post, but I couldn't find the answer there.

# 0m2.676s
#
# note that it should be == instead of in
# but as far as timing goes, it would be even faster
# (thanks to @Booboo for pointing this)
#
# here: ------------++
#                   ||
if any("xmuijdswly" in w for w in data):
    print("FOUND IT")

is much faster than:

# 0m13.476s
for d in data:
    if "xmuijdswly" == d:
        print("FOUND IT")
        break

my data contains 10^7 arbitrary strings of average length 30

EDIT: the time difference exists in both linux and windows:

PS ... > Measure-Command { python test_any.py }
TotalSeconds      : 4.5402383

PS ...> Measure-Command { python test_for.py }
TotalSeconds      : 17.7107506

EDIT: generating random strings program (for completeness)

$ cat main.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    FILE *fl=fopen("input.txt", "w+t");
    for (int i=0;i<10000000;i++)
    {
        int length = 5 + rand() % 50;
        for (int j=0;j<length;j++)
            fprintf(fl, "%c", rand() % 26 + 'a');
        fprintf(fl, "\n");
    }
    fclose(fl);
    return 0;
}
OrenIshShalom
  • 5,974
  • 9
  • 37
  • 87

2 Answers2

4

First, your two benchmarks are performing different tests. The first one, with the for loop, is matching a string against a list of strings by testing for equality. The second one, is using the in operator testing whether one string is a substring of another and would be a more expensive operation than testing for equality. But even if we change the second benchmark to instead test for equality, which should make it run even faster, I find the results to be the opposite of what you found, i.e. using any runs more slowly than the for loop. In the following demo I have arranged it that both benchmarks must iterate through all 10_000_000 strings because a match will never be found:

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def test2():
    return any('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w for w in data)

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

Prints:

False 0.2701404094696045
False 0.5590765476226807

The second version using any runs twice as slowly (perhaps you should re-check your results). Why does the version using any actually run more slowly?

The argument to any is a generator expression, which results in creating a generator function. So to test each value of the data a function call to generate the next comparison result must be made and that function call is what is making things run more slowly. We can turn your any version into one that uses an explicit generator function:

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def generator():
    for w in data:
        yield 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w

def test2():
    return any(generator())

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

Prints:

False 0.28400230407714844
False 0.5430142879486084

As you can see, the time required for both any versions is essentially the same, as I would expect.

Update: Using the OP's C Program to Generate the Data

Here I used the OP's C program to generate the data and I have used the same "xmuijdswly" string for the comparison:

import time

total_length = 0
data = []
with open('input.txt') as f:
    for line in f:
        line = line.strip()
        total_length += len(line)
        data.append(line)
print('Average string length:', total_length / len(data))

def test1():
    for d in data:
        if 'xmuijdswly' == d:
            return True
    return False

def test2():
    return any('xmuijdswly' == w for w in data)

t = time.time()
b = test1()
print(b,  'Time using for loop:', time.time() - t)

t = time.time()
b = test2()
print(b, 'Time using any:', time.time() - t)

Prints:

Average string length: 29.4972984
False Time using for loop: 0.3110032081604004
False Time using any: 0.6610157489776611

One Possible Explanation for OP's Results

The OP hasn't actually posted the complete programs that they were testing against, which presumably have to first read in the data from input.txt. If the any version was run second, the data in input.txt would have been cached by the operating system as a result of the looping version having read the file and therefore the I/O time to read the input data would be much less for the any version.

Was the time to do that initial creation of the data list erroneously included in the benchmark timings?

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Oh I completely missed that typo, thanks, I fixed it and measured times again. As expected, the time differences got bigger, but they are still in favor of `any` ... very strange indeed. I double checked on both linux and windows and the results are the same. – OrenIshShalom Jun 14 '22 at 05:25
  • 1
    I have updated the answer again with a possible explanation for our different results. – Booboo Jun 14 '22 at 10:48
  • Thanks for the detailed answer and effort ! you were right about the preliminary reading phase - it was *not* identical in the `for` and `any` ... when I used the same mechanism I get similar results to yours. – OrenIshShalom Jun 14 '22 at 12:14
0

any with generator expressions performs a lazy evaluation - as soon as you find first match it stops iterating. Meanwhile your if loop has to go through all the elements so it takes more time. If you add a break after the print they should take roughly similiar amounts of time.

matszwecja
  • 6,357
  • 2
  • 10
  • 17
  • I actually chose a word that is *not* in the data so this is not the reason :( ... I'll edit my question to make things clearer – OrenIshShalom Jun 09 '22 at 10:54
  • 1
    The OP edited the question and added a break. But with the break the version that uses the for loop actually runs twice as fast as the version that uses `any` because the `any` version converts the generator expression into a generator function and therefore a function call must be made to get the next comparison value to test. You should have tried running this yourself. – Booboo Jun 12 '22 at 12:16