26

Lets say that I have a graph and want to see if b in N[a]. Which is the faster implementation and why?

a, b = range(2)
N = [set([b]), set([a,b])]

OR

N= [[b],[a,b]]

This is obviously oversimplified, but imagine that the graph becomes really dense.

locoboy
  • 38,002
  • 70
  • 184
  • 260

3 Answers3

43

Membership testing in a set is vastly faster, especially for large sets. That is because the set uses a hash function to map to a bucket. Since Python implementations automatically resize that hash table, the speed can be constant (O(1)) no matter the size of the set (assuming the hash function is sufficiently good).

In contrast, to evaluate whether an object is a member of a list, Python has to compare every single member for equality, i.e. the test is O(n).

phihag
  • 278,196
  • 72
  • 453
  • 469
  • 1
    Beat me to it. It's also worth pointing out that the OP has a list of lists in the first example, and a list of sets in the second. Because the outer container is a list in both situations, the search is O(n) in both situations. – g.d.d.c Oct 10 '11 at 18:32
  • 1
    Python sets are dynamically resized hashtables, so you get amortized O(1) regardless of the number of items. @g.d.d.c OP asks about membership tests for *a single item* (`b in N[a]`) of the list of containers that is `N`, so no it isn't O(n) either way. –  Oct 10 '11 at 18:36
  • @delnan You're right, `O(1)` (+notes) is a better description than `O(log n)` (-notes). – phihag Oct 10 '11 at 18:53
  • 1
    @phihag O(1) isn't just better than O(log n): O(1) is correct, and O(log n) is not. – Ned Batchelder Oct 10 '11 at 19:06
  • 1
    @NedBatchelder: Actually, if you look up the mathematical definition, you'll find that O(log N) is correct whenever O(1) is – if it's constant, it's also bounded by log N. So O(1) *is* “just” better than O(log N). – Petr Viktorin Oct 10 '11 at 19:33
  • @PetrViktorin: yes, that is part of the definition of big-O, so you are correct. O(log n) is not wrong, it is simply misleading and useless! :) Other correct descriptions of the membership testing of a set are O(N**2), O(N**N), O(N**N**N), etc, etc. – Ned Batchelder Oct 11 '11 at 18:05
  • @NedBatchelder You're right. My instinct was to quote O(log n) because that's what's achievable no matter the precise tuning (of bucket numbers and hash functions), but `O(1)` is the correct description here. Fixed. – phihag Oct 11 '11 at 18:15
6

It all depends on what you're trying to accomplish. Using your example verbatim, it's faster to use lists, as you don't have to go through the overhead of creating the sets:

import timeit

def use_sets(a, b):
    return [set([b]), set([a, b])]

def use_lists(a, b):
    return [[b], [a, b]]

t=timeit.Timer("use_sets(a, b)", """from __main__ import use_sets
a, b = range(2)""")
print "use_sets()", t.timeit(number=1000000)

t=timeit.Timer("use_lists(a, b)", """from __main__ import use_lists
a, b = range(2)""")
print "use_lists()", t.timeit(number=1000000)

Produces:

use_sets() 1.57522511482
use_lists() 0.783344984055

However, for reasons already mentioned here, you benefit from using sets when you are searching large sets. It's impossible to tell by your example where that inflection point is for you and whether or not you'll see the benefit.

I suggest you test it both ways and go with whatever is faster for your specific use-case.

Austin Marshall
  • 2,991
  • 16
  • 14
3

Set ( I mean a hash based set like HashSet) is much faster than List to lookup for a value. List has to go sequentially to find out if the value exists. HashSet can directly jump and locate the bucket and look up for a value almost in a constant time.

java_mouse
  • 2,069
  • 4
  • 21
  • 30