6

Possible Duplicate:
Python: Retrieve items from a set

Consider the following code:

>>> item1 = (1,)
>>> item2 = (2,)
>>> s = set([item1, item2])
>>> s
set([(2,), (1,)])
>>> new_item = (1,)
>>> new_item in s
True
>>> new_item == item1
True
>>> new_item is item1
False

So new_item is in s because it is equivalent to one of its items, but it is a different object.

What I want is to get item1 from s given new_item is in s.

One solution I have come up with is straightforward but not very efficient:

def get_item(s, new_item):
    for item in s:
        if item == new_item:
            return item

>>> get_item(s, new_item) is new_item
False
>>> get_item(s, new_item) is item1
True

Another solution seems more efficient but actually does not work:

 def get_item_using_intersection1(s, new_item):
     return set([new_item]).intersection(s).pop()

Nor this one:

 def get_item_using_intersection2(s, new_item):
     return s.intersection(set([new_item])).pop()

Because intersection works in an undefined way:

>>> get_item_using_intersection1(s, new_item) is new_item
True
>>> get_item_using_intersection1(s, new_item) is item1
False

>>> get_item_using_intersection2(s, new_item) is new_item
True
>>> get_item_using_intersection2(s, new_item) is item1
False

If this matters, I am using Python 2.7 x64 on Windows 7, but I need a cross-platform solution.


Thanks to everyone. I came up with the following temporary solution:

class SearchableSet(set):

    def find(self, item):
        for e in self:
            if e == item:
                return e

which will be replaced in future with the following solution (which is very incomplete right now):

class SearchableSet(object):

    def __init__(self, iterable=None):
        self.__data = {}
        if iterable is not None:
            for e in iterable:
                self.__data[e] = e

    def __iter__(self):
        return iter(self.__data)

    def __len__(self):
        return len(self.__data)

    def __sub__(self, other):
        return SearchableSet(set(self).__sub__(set(other)))

    def add(self, item):
        if not item in self:
            self.__data[item] = item

    def find(self, item):
        return self.__data.get(item)
Community
  • 1
  • 1
utapyngo
  • 6,946
  • 3
  • 44
  • 65
  • 1
    But... The "inefficient solution" you came up is already linear. – kennytm Apr 30 '12 at 12:49
  • I think he means *constant* time – Charles Salvia Apr 30 '12 at 12:50
  • 1
    Why not just use `new_item` if it's equivalent to `item1`? (As it should be in this case.) If the items are not, in fact, equivalent, this is a design problem: you shouldn't store these objects in sets as if they were. (It's really hard to tell from your generic description of the problem.) – millimoose Apr 30 '12 at 12:55
  • @Charles, thank you. Did not manage to find that question. – utapyngo Apr 30 '12 at 12:56
  • @Inerdial, I have just simplified things to make the example easier to read. What I actually have is a set of instances of a class which overrides `__hash__` and `__eq__` methods. – utapyngo Apr 30 '12 at 12:57
  • 3
    @utapyngo If your classes implement `__hash__` and `__eq__`, and two instances that are equal to each other aren't interchangeable, then I'd argue this is a design flaw. It's not a good idea to implement those methods in objects that aren't "value objects". – millimoose Apr 30 '12 at 13:00
  • What makes you think that `intersection` works in O(1) time? – senderle Apr 30 '12 at 13:06
  • @senderle: I don't not say that `intersection` works in O(1) time in common case. But something makes me think that intersecting with a set of 1 element is faster than O(n). Should be O(min(n1, n2)). This can explain why we get same results for `get_item_using_intersection1` and `get_item_using_intersection2`. – utapyngo Apr 30 '12 at 13:10
  • 1
    I looked up the C source of intersection. It checks which set is smaller (self or argument) and iterates over it checking each element with the other set using `in`. So yeah, it's O(min(n1, n2)). The result set is built from the smaller set so your code always returns `new_item` from `pop()` because your temp set has 1 element and `s` has 2. You should use `dict` as others suggest. – yak Apr 30 '12 at 13:19
  • 1
    Just a question. Why are you so worried with the effiency with this at this point? Is your application really slow and you profiled all of it and found the set operation to be your main bottleneck? – Pedro Werneck Apr 30 '12 at 13:35
  • @Inerdial: They could be interchangeable, and yet there are still cases where you would want one or the other. For example, if they are very large immutable objects, you would want to minimize the number of copies. – Dietrich Epp Apr 30 '12 at 14:02
  • @DietrichEpp The appropriate pattern for that would be a repository that can retrieve the canonical instance for a copy. (I.e. a `dict`.) And it's perfectly fine to do that, my only problem was with the lack of interchangeability which could lead to weird bugs. – millimoose Apr 30 '12 at 14:23
  • @Inerdial: I think my objects **are** value objects. But each of them stores a link to its container. If I use `new_item` instead of `item1` I will lose this link. – utapyngo May 01 '12 at 04:21

2 Answers2

12

Don't use a set, then. Just use a dict that maps some value to itself. In your case, it maps:

d[item1] = item1
d[item2] = item2

So anything that's equal to item1 will be found in d, but the value is item1 itself. And it's much better than linear time ;-)

P.S. I hope I understood the intention of your question correctly. If not, please clarify it.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • Thank you. I know it is possible to use `dict`s but I also know that technically it is possible to stay with `set`s (assuming there is an internal method which can find an item by hash). Besides, I don't want to rewrite my old code because I use set operations intensively. – utapyngo Apr 30 '12 at 12:49
  • 7
    @utapyngo: it's better to rewrite old code if it's incorrect. `set` is simply not designed for this - use a more appropriate data structure. – Eli Bendersky Apr 30 '12 at 13:00
  • How to do inersection, union and difference of such dicts in linear time? – utapyngo May 01 '12 at 03:24
  • @utapyngo: since you can iterate over one dict in O(n) and check for membership in another dict in O(1), why would that be a problem? The only real problem is that these operations will be quite a bit slower than on sets, because set union/interesection etc. is implemented on the C level in the Python interpreter – Eli Bendersky May 01 '12 at 03:45
  • @utapyngo: another possible direction - if you need such a solution to perform well, consider writing a C extension type that derives from `set` and keeps the required information, allowing to efficiently access it – Eli Bendersky May 01 '12 at 03:47
2

If you absolutely need the O(1) lookup and object identity (not just equality) and fast set operations (without having to create new sets each time you want to do set operations), then one fairly straightforward approach is to use both a dict and a set. You would have to maintain both structures to keep them in sync, but this would allow you to keep O(1) access (just with a bigger constant factor). (And maybe this is what you are heading toward with your "future solution which is very incomplete right now" in your edit.)

However, you haven't mentioned the volume of data you're working with, or what kind of performance problems you're having, if any. So I'm not convinced you really need to do this. It could be that dict with as-needed set creation, or set with linear lookup, is already fast enough.

John Y
  • 14,123
  • 2
  • 48
  • 72