2

When using the in operator on a literal, is it most idiomatic for that literal to be a list, set, or tuple?

e.g.

for x in {'foo', 'bar', 'baz'}:
    doSomething(x)

...

if val in {1, 2, 3}:
    doSomethingElse(val)

I don't see any benefit to the list, but the tuple's immutably means it could be hoisted or reused by an efficient interpreter. And in the case of the if, if it's reused, there's an efficiency benefit.

Which is the most idiomatic, and which is most performant in cpython?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
David Ehrmann
  • 7,366
  • 2
  • 31
  • 40
  • 3
    Set. Because of `O(1)` lookup unlike lists and tuples. – freakish Nov 09 '17 at 19:47
  • With a `for` loop, `in` is part of the syntax, not an operator. Even `for x in 'foo', 'bar', 'baz':` would work, as the "target" doesn't have to be a single iterable object; it can just be a comma-separate sequence of objects. Any literal is equally reusable, because nothing else has a reference to it. – chepner Nov 09 '17 at 19:49
  • 1
    @freakish Only if the interpreter is clever enough to reuse it. – David Ehrmann Nov 09 '17 at 19:50
  • @chepner Doesn't that just create a tuple of `('foo', 'bar', 'baz')`, which _is_ a single iterable? –  Nov 09 '17 at 19:50
  • @DavidEhrmann No, `O(1)` lookup is always. What makes you think that? Whether the interpreter reuses a structure or not is irrelevant because it is the same issue for **all** of them. – freakish Nov 09 '17 at 19:53
  • @Blurp I guess so. I was just looking at the grammar to see what, exactly, the distinction was, but it's not clear. In this case, though, it's an O(n) operation to iterate over the value, regardless of its type. – chepner Nov 09 '17 at 19:54
  • 4
    @freakish you don't understand. If you have to *build a set every time* you can't escape O(N). Whether or not you have to build a set depends on whether or not the compiler is able to optimize this. For certain literals, the Python compiler indeed can accomplish this. – juanpa.arrivillaga Nov 09 '17 at 19:57
  • @freakish If the interpreter recreates the object each time, whether the lookup is O(1) or O(n) is moot; recreating the object takes O(n). *If*... If it's a const (see the answer below), then you can see a benefit. – David Ehrmann Nov 09 '17 at 19:57
  • @juanpa.arrivillaga You can't escape `O(N)` when you build lists as well. I don't see the point. What exactly are we measuring here? The perfomance of `in` or whole script? – freakish Nov 09 '17 at 19:58
  • Yes, indeed, **you can**. The compiler optimizes it to a `tuple`, which it saves as a constant, and you don't actually build a list, rather, you simply load that pre-build `tuple` – juanpa.arrivillaga Nov 09 '17 at 20:00
  • @juanpa.arrivillaga And what do you think is the complexity of loading pre-build tuple? – freakish Nov 09 '17 at 20:00
  • @freakish O(1), and not only that, it is **very fast** since it is a primitive array access. – juanpa.arrivillaga Nov 09 '17 at 20:01
  • @juanpa.arrivillaga Right, loading tuple of length `N` is `O(1)`. :D I mean the interpreter reads the file in constant time :D – freakish Nov 09 '17 at 20:01
  • @freakish yes... of course. Do you think every time I access a `tuple`, let's say `my_tup`, the time of that depends on the *length* of the tuple? – juanpa.arrivillaga Nov 09 '17 at 20:02
  • @juanpa.arrivillaga No, dude, loading any structure of size `N` to memory is at least `O(N)`. – freakish Nov 09 '17 at 20:03
  • @freakish so then all that work on the peep-hole optimizer was for fun? Caching doesn't work? The point is, **they are already loaded into memory**. Yes, of course, building the structure **once** is O(N), but then you re-use that structure. – juanpa.arrivillaga Nov 09 '17 at 20:05
  • @juanpa.arrivillaga How is the structure loaded to memory before you run the script? I absolutely have no idea what caching has to do with anything. Again: what is being measured here anyway? – freakish Nov 09 '17 at 20:08
  • @freakish take, for example, the two functions in my answer. Naively, one might suppose that the literals will be evaluated every time in the for-loop, thereby negating the O(1) membership check on a set with an O(N) build time. Indeed, this was the case in Python with set-literals before Python 3.2. However, in later versions, the compiler optimizes this, so you *can* rely on `O(1)` behavior for set literals in a function. – juanpa.arrivillaga Nov 09 '17 at 20:18
  • @juanpa.arrivillaga Yeah, your code is `O(1)` because compiler did `O(N)` setup earlier to create the list/set. That's the whole point: what is being measured? Because if you measure the script from begining to the end then both are `O(N)`. One might say that one uses `N` "operations" and the second `2N` (due to the need of double set creation) but both are `O(N)`. And this can be clearly noticed when you measure huge constant arrays/sets. In which case the `O(1)` lookup starts to matter and `O(N)` list/set loading matters. For small list/set it doesn't matter at all. – freakish Nov 09 '17 at 20:20

2 Answers2

7

Python provides a disassembler, so you can often just check the bytecode:

In [4]: def checktup():
   ...:     for _ in range(10):
   ...:         if val in (1, 2, 3):
   ...:             print("foo")
   ...:

In [5]: def checkset():
   ...:     for _ in range(10):
   ...:         if val in {1, 2, 3}:
   ...:             print("foo")
   ...:


In [6]: import dis

For the tuple literal:

In [7]: dis.dis(checktup)
  2           0 SETUP_LOOP              32 (to 34)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_CONST               1 (10)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                20 (to 32)
             12 STORE_FAST               0 (_)

  3          14 LOAD_GLOBAL              1 (val)
             16 LOAD_CONST               6 ((1, 2, 3))
             18 COMPARE_OP               6 (in)
             20 POP_JUMP_IF_FALSE       10

  4          22 LOAD_GLOBAL              2 (print)
             24 LOAD_CONST               5 ('foo')
             26 CALL_FUNCTION            1
             28 POP_TOP
             30 JUMP_ABSOLUTE           10
        >>   32 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE

For the set-literal:

In [8]: dis.dis(checkset)
  2           0 SETUP_LOOP              32 (to 34)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_CONST               1 (10)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                20 (to 32)
             12 STORE_FAST               0 (_)

  3          14 LOAD_GLOBAL              1 (val)
             16 LOAD_CONST               6 (frozenset({1, 2, 3}))
             18 COMPARE_OP               6 (in)
             20 POP_JUMP_IF_FALSE       10

  4          22 LOAD_GLOBAL              2 (print)
             24 LOAD_CONST               5 ('foo')
             26 CALL_FUNCTION            1
             28 POP_TOP
             30 JUMP_ABSOLUTE           10
        >>   32 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE

You'll notice that in both cases, the function will LOAD_CONST, i.e., both times it has been optimized. Even better, in the case of the set literal, the compiler has saved a frozenset, which during the construction of the function, the peephole-optimizer has managed to figure out can become the immutable equivalent of a set.

Note, on Python 2, the compiler builds a set every time!:

In [1]: import dis

In [2]: def checkset():
   ...:     for _ in range(10):
   ...:         if val in {1, 2, 3}:
   ...:             print("foo")
   ...:

In [3]: dis.dis(checkset)
  2           0 SETUP_LOOP              49 (to 52)
              3 LOAD_GLOBAL              0 (range)
              6 LOAD_CONST               1 (10)
              9 CALL_FUNCTION            1
             12 GET_ITER
        >>   13 FOR_ITER                35 (to 51)
             16 STORE_FAST               0 (_)

  3          19 LOAD_GLOBAL              1 (val)
             22 LOAD_CONST               2 (1)
             25 LOAD_CONST               3 (2)
             28 LOAD_CONST               4 (3)
             31 BUILD_SET                3
             34 COMPARE_OP               6 (in)
             37 POP_JUMP_IF_FALSE       13

  4          40 LOAD_CONST               5 ('foo')
             43 PRINT_ITEM
             44 PRINT_NEWLINE
             45 JUMP_ABSOLUTE           13
             48 JUMP_ABSOLUTE           13
        >>   51 POP_BLOCK
        >>   52 LOAD_CONST               0 (None)
             55 RETURN_VALUE
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
0

IMO, there's essentially no such thing as "idiomatic" usage of literal values as shown in the question. Such values look like "magic numbers" to me. Using literals for "performance" is probably misguided because it sacrifices readability for marginal gains. In cases where performance really matters, using literals is unlikely to help much and there are better options regardless.

I think the idiomatic thing to do would be to store such values in a global or class variable, especially if you're using them in multiple places (but also even if you aren't). This provides some documentation as to what a value's purpose is and makes it easier to update. You can then memomize these values in function/method definitions to improve performance if necessary.

As to what type of data structure is most appropriate, that would depend on what your program does and how it uses the data. For example, does ordering matter? With an if x in y, it won't, but maybe you're using the data in a for and an if. Without context, it's hard to say what the best choice would be.

Here's an example I think is readable, extensible, and also efficient. Memoizing the global ITEMS in the function definitions makes lookup fast because items is in the local namespace of the function. If you look at the disassembled code, you'll see that items is looked up via LOAD_FAST instead of LOAD_GLOBAL. This approach also avoids making multiple copies of the list of items, which might be relevant if it's big enough (although, if it was big enough, you probably wouldn't try to inline it anyway). Personally, I wouldn't bother with these kinds of optimizations most of the time, but they can be useful in some cases.

# In real code, this would have a domain-specific name instead of the
# generic `ITEMS`.
ITEMS = {'a', 'b', 'c'}

def filter_in_items(values, items=ITEMS):
    matching_items = []
    for value in values:
        if value in items:
            matching_items.append(value)
    return matching_items

def filter_not_in_items(values, items=ITEMS):
    non_matching_items = []
    for value in values:
        if value not in items:
            non_matching_items.append(value)
    return non_matching_items

print(filter_in_items(('a', 'x')))      # -> ['a']
print(filter_not_in_items(('a', 'x')))  # -> ['x']

import dis
dis.dis(filter_in_items)
  • @juanpa.arrivillaga I strongly disagree. What's your reasoning? –  Nov 09 '17 at 20:02
  • 2
    Look at my answer. The compiler optimizes literals into immutable constants that it loads very, very quickly. A global variable would require a global lookup, much slower. – juanpa.arrivillaga Nov 09 '17 at 20:03
  • I'm not talking about performance, which is rarely a concern. I'm talking about "idiomatic" code that's easy to understand and change. –  Nov 09 '17 at 20:04
  • global **constants** are *ok* in compiled languages, but never **variables**. see [this](https://stackoverflow.com/a/19158418/3220135) for a quick overview – Aaron Nov 09 '17 at 20:05
  • Writing out the same literal multiple times in the same code is a bad practice, isn't it? The question mentions reuse. –  Nov 09 '17 at 20:05
  • 2
    @Blurp that's a fair point, but the question is *specifically* about literals: "When using the in operator on a literal, is it most idiomatic for that literal to be a list, set, or tuple" – juanpa.arrivillaga Nov 09 '17 at 20:05
  • There are globals all throughout the standard library. They're okay to use for some things. –  Nov 09 '17 at 20:08
  • @Blurp I don't disagree. Although, now I see that my original comment would suggest otherwise. For the record, I didn't downvote. – juanpa.arrivillaga Nov 09 '17 at 20:12