1

I'm looking for a pythonic way to map a given value to a tuple of keys in a data structure, so that datastructure[key1, key2] returns the same value as datastructure[key2, key1], without having to save the value twice for each mutation of the key tuple. Ideally, I'm looking for something like a dictionary with sets or "order-insensitive tuples" (I know, that tuples are by definition ordered) as keys.

My usecase for this is: I need to save similarity values of pairs of strings in an efficient way. Since sim(word1, word2) equals sim(word2, word1) (so ordering doesn't matter), I don't want to save the values for each tuple, because it would be redundant. Instead, I'd like to save one value for each word pair (the tuple key) and access it by either ordering of the words that the pair is made up.

Currently, I'm doing it like this and it's working fine:

d = {tuple(sorted(['w1', 'w2'])): 4}

print(d[tuple(sorted(['w1', 'w2']))])
print(d[tuple(sorted(['w2', 'w1']))])

>> 4
>> 4

But each update and access to the dictionary now requires a list initiation, sort action, and tuple transformation.

My question is therefore: Is there an elegant way to do, what I want, without having to use the above workaround or defining a custom class as key? Maybe there is a datastructure that allows exactly that and may even be faster because it is implemented on a lower level than my workaround. Ideally this would look something like this:

d = {('w1', 'w2'): 4}

print(d[('w1','w2')])
print(d[('w2','w1')])

>> 4
>> 4

Note: The above example doesn't work, because ('a2','b2') != ('b2','a2').

Here is the research I've done so far and an explanation of why these solutions aren't exactly what I'm looking for:

I'm thankful for any hints and make excuses in advance, if a duplicate Stackoverflow question exists and I overlooked it - it's hard to search for this, since I don't know how the desired feature would be called.

AlinaOs
  • 13
  • 2
  • @FrankYellin Thank you, I did'nt know about frozensets. It works and seems to be exactly what I was looking for in terms of datastructure properties! But for this special case I'll go with the alternative proposed by Kelly Bundy since it is faster. – AlinaOs Jul 02 '23 at 14:28

2 Answers2

0

You could use a helper function that creates sorted tuples in a cheaper way:

def key(s, t):
    return (s, t) if s < t else (t, s)

d = {key('w1', 'w2'): 4}

print(d[key('w1','w2')])
print(d[key('w2','w1')])

The function call overhead makes it slower than inlining the expression would be, but it's still faster than the alternatives mentioned so far.

Here are various expressions and their times for evaluating them and using them to look up the dict value:

 132.7 ± 1.8 ns  (s, t) if s < t else (t, s)
 171.7 ± 1.5 ns  key(s, t)
 236.5 ± 0.9 ns  frozenset({s, t})
 254.9 ± 2.5 ns  frozenset((s, t))
 273.3 ± 2.0 ns  frozenset([s, t])
 343.2 ± 3.6 ns  tuple(sorted([s, t]))
 767.0 ± 9.7 ns  T(s, t)

Benchmark code (Attempt This Online!):

from timeit import timeit
from statistics import mean, stdev

setup = '''
s, t = 'w1', 'w2'
sts = [(s, t), (t, s)] * 5000

def key(s, t):
    return (s, t) if s < t else (t, s)

# from Anentropic's answer
class T(tuple):
    def __new__(cls, *args):
        return super().__new__(cls, tuple(sorted(args)))

d = {KEY: 4}
'''

funcs = [
    "tuple(sorted([s, t]))",
    "frozenset([s, t])",
    "frozenset((s, t))",
    "frozenset({s, t})",
    "key(s, t)",
    "(s, t) if s < t else (t, s)",
    "T(s, t)",
]

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:10]]
    return f'{mean(ts):6.1f} ± {stdev(ts):3.1f} ns '

for _ in range(100):
    for f in funcs:
        code = f'for s, t in sts: d[{f}]'
        setup_ = setup.replace('KEY', f)
        t = timeit(code, setup_, number=10**0) / 1e4
        times[f].append(t)

for f in sorted(funcs, key=stats):
    print(stats(f), f)
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
0

Here's a super simple implementation of order-insensitive tuple:

class T(tuple):
    def __new__(cls, *args):
        return super().__new__(cls, tuple(sorted(args)))

Noting that:

  • it loses any information about the original order of members, it 'canonicalises' the tuple into sorted order.
  • T is a bad name for a class, but it makes the usage very succinct
In [32]: t1 = T('a', 'b')

In [33]: hash(t1)
Out[33]: 9116817675213165770

In [34]: t2 = T('b', 'a')

In [35]: hash(t2)
Out[35]: 9116817675213165770

In [36]: t3 = T('a', 'c')

In [37]: t4 = T('a', 'b', 'c')

In [38]: hash(t3)
Out[38]: -385757029225169101

In [39]: hash(t4)
Out[39]: 6815493159389633286

One advantage of this over frozenset is that it is still a tuple, so it can distinguish between say ('a', 'a', 'b') and ('a', 'b'), whereas the set will ignore duplicates.

But since you only need pairs of values and not arbitrary or varying length tuples you are probably best off using the key function from Kelly Bundy's excellent answer.

Anentropic
  • 32,188
  • 12
  • 99
  • 147
  • Makes it nicer but also adds large overhead, about 400 ns in my updated benchmark. With `def T(*args): return tuple(sorted(args))` the overhead would just be about 40 ns. – Kelly Bundy Jul 02 '23 at 14:14
  • @KellyBundy ...interesting! I wonder why it's so much slower - naively it seems to be doing much the same thing. Presumably the nature of sub-classing introduces a bunch of extra overhead vs bare builtins `tuple`. And making use of `*args` adds another transient tuple instantiation. – Anentropic Jul 02 '23 at 14:35