2

I am modeling data for an application and decided to choose dictionary as my data structure. But each row in the data has multiple keys. So I created a dictionary with multiple keys mapping each row, something like:

>>> multiKeyDict = {}
>>> multiKeyDict[('key1','key2','key3')] = 'value1'
>>> multiKeyDict.get(('key1','key2','key3'))
'value1'

Now I have to retrieve all the values with key1 in O(1) time. From my research I know I could do:

I am also open for any better data structures instead of using the dictionary.

Community
  • 1
  • 1
PseudoAj
  • 5,234
  • 2
  • 17
  • 37

2 Answers2

1

You don't have multiple keys. As far as the Python dictionary is concerned, there is just one key, a tuple object. You can't search for the constituents of the tuple in anything other than O(N) linear time.

If your keys are unique, just add each key individually:

multiKeyDict['key1'] = multiKeyDict['key2'] = multiKeyDict['key3'] = 'value1'

Now you have 3 keys all referencing one value. The value object is not duplicated here, only the references to it are.

The multi_key_dict package you found uses an intermediate mapping to map a given constituent key to the composite key, which then maps to the value. This gives you O(1) search too, with the same limitation that each constituent key must be unique.

If your keys are not unique then you need to map each key to another container that then holds the values, like a set for instance:

for key in ('key1', 'key2', 'key3):
    multiKeyDict.setdefault(key, set()).add(value)

Now looking up a key gives you the set of all values that that key references.

If you need to be able to combine keys too, then you can add additional references with those combinations. Key-value pairings are relatively cheap, it's all just references. The key and value objects themselves are not duplicated.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • `key1` might have multiple values and I don't want to map values to each keys as it won't scale with data – PseudoAj Mar 09 '17 at 14:53
  • @PseudoAj: then you don't have data that is suitable for a hash table, and are stuck with a linear search for this data structure. The same applies to the `multi_key_dict` package you found. – Martijn Pieters Mar 09 '17 at 14:55
  • @PseudoAj what do you mean by "won't scale with data" – Jean-François Fabre Mar 09 '17 at 14:57
  • @PseudoAj: so what did you expect to find when searching for `key1` if there are multiple values? You can't have a O(1) search if there are K values per key. – Martijn Pieters Mar 09 '17 at 14:57
  • @Jean-FrançoisFabre for each row you are creating three entries in the data. So if you have million records then you will be having 3 million records ... also think about if I want to add additional key.. – PseudoAj Mar 09 '17 at 14:58
  • @PseudoAj at least it doesn't copy the values. Only dict entries. – Jean-François Fabre Mar 09 '17 at 15:01
  • 3
    @PseudoAj: if you have millions of records with complex relationships, then it sounds like you want a relational database instead. Let *it* worry about efficient data structures. – Martijn Pieters Mar 09 '17 at 15:01
  • @MartijnPieters yes, I am now seriously thinking of using database or key value store like Redis for this... – PseudoAj Mar 09 '17 at 15:04
0

Another possibility is to build up an index to a list of row-objects which share a key-component. Provided the number of rows sharing any particular key value is small, this will be quite efficient. (Assume row-objects have keys accessed as row.key1, row.key2 etc., that's not a very relevant detail). Untested code:

index = {}
for row in rows:
    index.setdefault( row.key1, []).append(row)
    index.setdefault( row.key2, []).append(row)
    index.setdefault( row.key3, []).append(row)

and then for looking up rows that match, say, key2 and key3

candidates = index[ key2] 
if len( index[key3]) < len(candidates): 
    candidates = index[key3] # use key3 if it offers a better distribution
results = []
for cand in candidates:
    if cand.key2 == key2 and cand.key3 == key3: # full test is necessary!
        results.append( cand)
nigel222
  • 7,582
  • 1
  • 14
  • 22