7

In general, Python sets don't seem to be designed for retrieving items by key. That's obviously what dictionaries are for. But is there anyway that, given a key, you can retrieve an instance from a set which is equal to the key?

Again, I know this is exactly what dictionaries are for, but as far as I can see, there are legitimate reasons to want to do this with a set. Suppose you have a class defined something like:

class Person:
   def __init__(self, firstname, lastname, age):
      self.firstname = firstname
      self.lastname = lastname
      self.age = age

Now, suppose I am going to be creating a large number of Person objects, and each time I create a Person object I need to make sure it is not a duplicate of a previous Person object. A Person is considered a duplicate of another Person if they have the same firstname, regardless of other instance variables. So naturally the obvious thing to do is insert all Person objects into a set, and define a __hash__ and __eq__ method so that Person objects are compared by their firstname.

An alternate option would be to create a dictionary of Person objects, and use a separately created firstname string as the key. The drawback here is that I'd be duplicating the firstname string. This isn't really a problem in most cases, but what if I have 10,000,000 Person objects? The redundant string storage could really start adding up in terms of memory usage.

But if two Person objects compare equally, I need to be able to retrieve the original object so that the additional instance variables (aside from firstname) can be merged in a way required by the business logic. Which brings me back to my problem: I need some way to retrieve instances from a set.

Is there anyway to do this? Or is using a dictionary the only real option here?

Channel72
  • 24,139
  • 32
  • 108
  • 180
  • "Python sets don't seem to be designed for retrieving items by key". That's a matter of definition. Sets don't have keys. Each item in the set is it's own key. A "set with a key" is a dictionary, by definition. I'm not sure this question makes sense considering that the definitions seem very clear. – S.Lott May 12 '11 at 15:45

3 Answers3

8

I'd definitely use a dictionary here. Reusing the firstname instance variable as a dictionary key won't copy it -- the dictionary will simply use the same object. I doubt a dictionary will use significantly more memory than a set.

To actually save memory, add a __slots__ attribute to your classes. This will prevent each of you 10,000,000 instances from having a __dict__ attribute, which will save much more memory than the potential overhead of a dict over a set.

Edit: Some numbers to back my claims. I defined a stupid example class storing pairs of random strings:

def rand_str():
    return str.join("", (chr(random.randrange(97, 123))
                         for i in range(random.randrange(3, 16))))

class A(object):
    def __init__(self):
        self.x = rand_str()
        self.y = rand_str()
    def __hash__(self):
        return hash(self.x)
    def __eq__(self, other):
        return self.x == other.x

The amount of memory used by a set of 1,000,000 instances of this class

random.seed(42)
s = set(A() for i in xrange(1000000))

is on my machine 240 MB. If I add

    __slots__ = ("x", "y")

to the class, this goes down to 112 MB. If I store the same data in a dictionary

def key_value():
    a = A()
    return a.x, a

random.seed(42)
d = dict(key_value() for i in xrange(1000000))

this uses 249 MB without __slots__ and 121 MB with __slots__.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • From what I understand, internally to CPython sets *are* dictionaries, they just have null as the stored value. – kindall May 12 '11 at 17:28
  • 2
    @kindall: Maybe at some point sets were dictionaries, but in current versions of Python sets have thier own [C data type](http://svn.python.org/view/python/tags/r271/Include/setobject.h?view=markup) and their own [C implementation](http://svn.python.org/view/python/tags/r271/Objects/setobject.c?view=markup). – Sven Marnach May 12 '11 at 18:59
3

Yes, you can do this: A set can be iterated over. But note that this is an O(n) operation as opposed to the O(1) operation of the dict.

So, you have to trade off speed versus memory. This is a classic. I personally would optimize for here (i.e. use the dictionary), since memory won't get short so quickly with only 10,000,000 objects and using dictionaries is really easy.

As for additional memory consumption for the firstname string: Since strings are immutable in Python, assigning the firstname attribute as a key will not create a new string, but just copy the reference.

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • I see. The annoying thing is that, although speed versus memory is a classic tradeoff, in this particular case there's no inherent *need* for the tradeoff. A set could just as easily have O(1) retrieval, but it simply doesn't due to a language limitation. – Channel72 May 12 '11 at 14:56
  • why should it have O(1) retrieval? how would you implement that in a general case? You can probably view a set as a degenerate dictionary, with the value used as the key itself. But retrieving something you already have is rather pointless, non? – Daren Thomas May 12 '11 at 15:00
  • You wrote "Since strings are immutable in Python, assigning the firstname attribute as a key will not create a new string, but just copy the reference." This is not a feature of immutability in Python, rather it's a feature of how Python handles names and assignment. You can have a mutable object and use it as a key and Python will not create a new instance of the object. – Steven Rumbalski May 12 '11 at 15:19
  • @Steven Rumbalski: Right. I was being rather imprecise (i.e. *wrong*) there. When assigning strings, though, it can be nice to know if you are getting a copy or, if you aren't, that your string won't be mutated from somewhere else in the code. Thank you for pointing that out! – Daren Thomas May 13 '11 at 07:26
2

I think you'll have the answer here:

Moving Beyond Factories in Python

Community
  • 1
  • 1
eyquem
  • 26,771
  • 7
  • 38
  • 46