2

recently i saw a question here on python here

1: Python If statement and logical operator issue. There was someone in comments who gave an answer that it can be done like this:

1 in (1, 2, 3) to check if 1 is in present in a collection of items.but according to me this should be much faster 1 in {1, 2, 3}. as you can see there in discussions ,someone of high reputation goes on saying that ( ) is faster for fixed size input. and has faster lookup than { }. i am asking it here because i want to know for my own understanding which one is correct and also i do not get the idea of ( ) being fiexd-size or variable size. i just asked for a refrence in that original question so i can correct myself if i am wrong but the user insits upon clearing my basics of computer science knowledge without givin g a single refrence to his argument that lookup in Tuple is O(1).so i am asking it here.

anekix
  • 2,393
  • 2
  • 30
  • 57
  • He's saying that there is no `n`; you know exactly what tuple you're searching, and it's small. It's not explained very well, and some of his other points are wrong, but he has a point. – user2357112 Jul 10 '17 at 17:03
  • "Fixed-size" means the size cannot change; "variable size" means it can. – Scott Hunter Jul 10 '17 at 17:04
  • @ScottHunter `tuple` is fixed always once allocated the same tuple cant be changed .i meant size in that sense. there the example was 2 element tuple but i was talking about any arbitrary size. or just lookup in `tuple`. – anekix Jul 10 '17 at 17:06
  • First of all, what execution figures did you get when you tested your hypothesis with large structures? – Prune Jul 10 '17 at 17:10
  • 1
    @anekix tuples have to be sequentially scanned to complete a membership test... if it's tiny and your values appear early, then a loop and equality check will be able to outperform a `set` which has the overhead of hashing and then equality checking. As soon as you get past more than a completely non-trivial number of elements though, it's much more efficient. – Jon Clements Jul 10 '17 at 17:11
  • @JonClements thanks. i was just talking about general lookup of `set` vs `tuple` when considering any arbitrary `n`.my bad – anekix Jul 10 '17 at 17:17
  • 4
    This article possibly has useful/related stats: https://technobeans.com/2012/04/09/performance-for-testing-memberships-list-vs-tuples-vs-sets/ : `Conclusions` : `1. for testing membership of elements at the beginning of the collection, lists or tuples perform more than 100% better than sets`, `2. As soon as you start testing membership of elements that are in the middle or end of the collection, sets perform 40% – 1800% better than lists or tuples` - unfortunately it's over 5 years old and python v.2.7.2, so may be very dated info – chickity china chinese chicken Jul 10 '17 at 17:17

3 Answers3

6

When you say something like O(n), you have to say what n is. Here, n is the length of the tuple... but the tuple is not an input. You're not taking the tuple as an argument or anything. n is always 2 in the conversation you linked, or 3 for your example tuple, so for this particular n, O(n) is the same thing as O(2), or O(1).

As you may have noticed by now, it doesn't make much sense to talk about O(n) when n is a constant. If you had a function like

def in_(element, tup):
    return element in tup

you could say that the runtime is O(n) element comparisons, where n is len(tup), but for something like

usr in ('Y', 'y')

talking about n isn't very useful.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • This is perfectly valid when it has always two elements or fixed elemet. but i was generalizing i didn't notice that.i was just referring to simple lookup complexity of `Set` vs `Tuple`. – anekix Jul 10 '17 at 17:14
  • also i was generalizing because he made a point `Array lookup is fast`.but as explained by @downshift i get the point now – anekix Jul 10 '17 at 17:28
3

Although the other commenter is technically correct that x in <constant expression> is O(1) at run-time, it is still interesting to compare performance of sets and tuples of the same size. The discussion can be generalized it to consider different programs containing expressions of the form x in {a, b, c, ...}. If the expression consists of n items, then n can be considered input for big-O analysis, among all possible such programs. (And if still one insisted that it be possible to provide a different n at run-time, simply imagine that the function is created using exec.)

The issue with performance of such expressions is that the Python run-time must construct a one-time set for use in the in test, and then immediately discard it. This is clearly visible in the generated assembly:

>>> import dis
>>> def is_valid(x):
...     return x in {1, 2, 3}
... 
>>> dis.dis(is_valid)
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 (2)
              9 LOAD_CONST               3 (3)
             12 BUILD_SET                3
             15 COMPARE_OP               6 (in)
             18 RETURN_VALUE        

Constructing a set of n elements obviously has a cost of at least O(n). In other words, the test using a set, implemented as a literal constant, is not O(1) because the interpreter has to construct the set. This is what the commenter was trying to get across by referring to the cost of construction.

In fact, it gets stranger than that; due to nature of the Python VM, the compiler is allowed to construct tuples consisting only of numeric literals at compile time, which it does:

>>> import dis
>>> def is_valid(x):
...     return x in (1, 2, 3)
... 
>>> dis.dis(is_valid)
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               4 ((1, 2, 3))
              6 COMPARE_OP               6 (in)
              9 RETURN_VALUE        

Note how the (1, 2, 3) constant doesn't need to be built item by item - that is because it was already built by the compiler and inserted into the function's environment. As a result, this implementation of is_valid might actually be faster than the one using a set! This is easy to test:

$ python -m timeit -s 'def is_valid(x): return x in {1, 2, 3}' 'is_valid(-1)'
10000000 loops, best of 3: 0.189 usec per loop
$ python -m timeit -s 'def is_valid(x): return x in (1, 2, 3)' 'is_valid(-1)'
10000000 loops, best of 3: 0.128 usec per loop

Again, the other commenter was right.

Increasing the size of the set/tuple doesn't tip the balance in favor of set - it is always more expensive to construct a set of n items and then perform a blindingly-fast constant-time search, than to just iterate over a pre-created tuple looking for an item. This is because set creation has to allocate the set (possibly several times) and calculate hashes of all the items. Although both tuple search and set size are O(n), the set's one has a larger constant factor.

The correct way to implement an O(1) lookup would entail manually implementing the optimization that the compiler does automatically for tuples:

_valid = {1, 2, 3}
def is_valid(x):
    return x in _valid

Comparing this code with the equivalent code using a tuple, the set is always faster, even with a small number of items. As the number of items grows, set becomes the clear winner with its O(1) lookup.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • In recent versions, constant set literals are built as a constant as well. At least since Python 3.6 (possibly earlier) the original ``is_valid`` uses the instruction ``LOAD_CONST 0 (frozenset({1, 2, 3}))``. – MisterMiyagi Feb 19 '20 at 17:17
3

The case for set has been improved in python 3.5 (perhaps earlier). Here is my test function:

def is_valid(x):
  return x in ('name', 'known_as')

The code generated is:

  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               3 (('name', 'known_as'))
              6 COMPARE_OP               6 (in)
              9 RETURN_VALUE

Changing the tuple to a set generates this code:

  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               3 (frozenset({'name', 'known_as'}))
              6 COMPARE_OP               6 (in)
              9 RETURN_VALUE

Python 3.7 and 3.8 generate the same code for both cases. The timing results are:

$ python3.5 -m timeit -s "def is_valid(x): return x in {'name', 'known_as'}" "is_valid('')"
10000000 loops, best of 3: 0.0815 usec per loop
$ python3.5 -m timeit -s "def is_valid(x): return x in ('name', 'known_as')" "is_valid('')"
10000000 loops, best of 3: 0.0997 usec per loop

Passing 'name' to is_valid for the tuple case runs in 0.0921 usec per loop, still slower than set.

Tom Ekberg
  • 2,133
  • 1
  • 13
  • 8