Logically merging physically distinct dictionaries

Question

I am trying to write a wrapper around two dictionaries, so that they seem like one dictionary (for reading only; writing should raise Exceptions).

I am doing this to save memory, since one of the original dictionaries is needed elsewhere. I also think it's faster than merging the dictionaries, if less than half the elements in the combined dictionaries are going to be looked up.

Here's my attempt:

class LogicalMerge:
  def __init__(self, d1, d2):
    #d1 and d2 are dictionaries
    self.d1 = d1
    self.d2 = d2
  def __getitem__(self, x):
    if x in self.d1:
      return self.d1[x]
    else:
      return self.d2[x]

d1 = {1:2, 3:4}
d2 = {5:10}
d = LogicalMerge(d1, d2)
d[1] # == 2
d[5] # == 10

Are there any design, technical, or performance problems with this approach?

There are going to be asymmetry between the two dict? Like one longer then the other or more used? Or like that the last used is probably going to be the one to be used again? — Rik Poggi, Jan 21 '12 at 21:02
The one that will be reused elsewhere is `d2`. `d1` isn't used anywhere except in `LogicalMerge.__init__` (and only once too). Other than that, I can't say which dictionary is larger, in which order they'll be accessed, etc. — max, Jan 21 '12 at 21:18

Fred Foo · Answer 1 · 2012-01-25T11:19:51.897

4

You can save yourself one lookup per call by rewriting __getitem__ as

try:
    return self.d1[x]
except KeyError:
    return self.d2[x]

This follows the pythonic idiom of "asking forgiveness, not permission".

I also think it's faster than merging the dictionaries

I strongly doubt that, but you should measure to be sure. Your approach introduces an extra level of indirection and requires the key to be hashed multiple times. It's also bound to take more memory.

Edit: here's an alternative approach. Replace one of your dicts with a DictWithBackup, which behaves like a dict except that when a key is missing, it looks into the other dict.

class DictWithBackup(dict):
    def __init__(self, backup):
         self._backup = backup

    def __missing__(self, key):
         return self._backup[key]

This versions avoids the exception handling.

edited Jan 25 '12 at 11:19

answered Jan 21 '12 at 20:47

Fred Foo

355,277
75
744
836

It should not take more memory, and the try/except version will probably be slower: setting up a try/except block is not "free" nd I'd bet it takes more resources than an extra lookup in one of the dictionaries. – jsbueno Jan 21 '12 at 22:19
1

I strongly support "asking forgiveness, not permission" but if [this question's answers](http://stackoverflow.com/questions/2522005/cost-of-exception-handlers-in-python) are accurate I assume the try/except block will be quite slower than an if block in this case. Every item that's in d2 will not be in d1 (according to one of max's comments) so we'll get an exception for every item in d2. If d2's size is comparable to d1's there will be quite a lot of exceptions. I suppose the idiom applies to cases where exceptions are rare.. – nitsas Jan 22 '12 at 01:29
@chrisn654: [Python - Is a dictionary slow to find frequency of each character?](http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617) indicates that `.setdefault()` (lookup) can be slower than `try/except`. Performance questions are best to be answered using real input data and a profiler. – jfs Jan 22 '12 at 09:51
@larsmans: "hashed multiple times" - that's a good point. However, the memory costs of merging the dictionaries is too high, so it's not an option. I'd have to eat whatever time costs I have to. – max Jan 25 '12 at 00:49

moooeeeep · Answer 2 · 2012-01-25T11:34:10.240

3

For performance reasons I would prefer the following. Given None is an object that never validly occurs.

def __getitem__(self, k):
  v = self.d1.get(k, None)
  if v is None:
    v = self.d2[k] # if you're going to raise an error anyway ...
  return v

Otherwise you could default-get a custom object. Note, that you either need an object that implements __eq__ to test value equality (o1 == o2), or---performance-wise even better---that you use an immutable object, i.e. a certain string "error_key_not_found_string", that is not newly created every time. Then you may even compare by object identity id(o1) == id(o2), i.e. using the is operator. (You don't need to provide __eq__ then either.)

def __getitem__(self, k):
  v = self.d1.get(k, "error_key_not_found_string")
  # if id(v) == id("error_key_not_found_string":
  if v is "error_key_not_found_string": 
    v = self.d2[k] # if you're going to raise an error anyway ...
  return v

Have you thought about the case, where the item is in both dictionaries?

In conclusion, I find this a bit confusing from the design perspective. Is the performance gain really justifying the additional source of error and confusion? Plus you will lose all the other dict functionality... This could be as easy as d1.update(d2). Given d1 is the dictionary you don't use elsewhere (you could use a deepcopy then).

edited Jan 25 '12 at 11:34

answered Jan 21 '12 at 21:15

moooeeeep

31,622
22
98
187

In case both dictionaries contain the same key, I would have known that earlier in the program, and raised an exception. As to the justification for doing this, I am choosing to do logical rather than physical merge to save memory rather than run-time. Both dictionaries are large, and one of them (`d2`) must be left untouched. – max Jan 21 '12 at 21:21
it boils down to the question _is performance really a matter?_ that of course only you can answer. If not, I would prefer the clearer code. – moooeeeep Jan 21 '12 at 21:28
If performance is so critical he could just use (please bear with me): `v = self.d1.get(k, self.d2.get(k, None))` and then a single `if v is None: raise KeyError(k)`. – nitsas Jan 22 '12 at 01:38
Performance would be nice, if the code is only slightly more complex. Memory usage, however, is absolutely critical. – max Jan 22 '12 at 02:25
@chrisn654 : I was also thinking about nesting this into one call of `get`, but I was not sure (and not in the mood of trying it out) about the order of evaluation. Not sure if you'd get the shortcut behavior then... – moooeeeep Jan 22 '12 at 09:39
1

@moooeeeep: you definitely don't get a shortcut behaviour with nested `.get()`. – jfs Jan 22 '12 at 09:55
@J.F.Sebastian: I suspected that. – moooeeeep Jan 23 '12 at 13:20
@moooeeeep d1.update(d2) would require copying enormous amounts of data, many times over (the dictionaries contain records, very large ones; and there are tens of thousands of them). I can't use `None`; it's too dangerous (even if it's allowed in the records now, it might be one day). `KeyError()` would be fine; I'll see if it's faster. – max Jan 25 '12 at 00:46
@max: make sure to use an object that implements `__eq__` (not sure if `KeyError` does. It will be a performance difference if you use an immutable object or not, i.e. a certain string, that is not newly created every time... – moooeeeep Jan 25 '12 at 10:02

Logically merging physically distinct dictionaries

2 Answers2