3

In order to quickly compare the keys of 2 dictionaries, I'm creating sets of the keys using this method:

dict_1 = {"file_1":10, "file_2":20, "file_3":30, "file_4":40}
dict_2 = {"file_1":10, "file_2":20, "file_3":30}
set_1 = {file for file in dict_1}
set_2 = {file for file in dict_2}

Than I use diff_set = set_1 - set_2 to see which keys are missing from set_2.

Is there a faster way? I see that using set(dict.keys()) is less of a workarou, so I'll switch to it - but is it more efficient?

Asi
  • 31
  • 4

2 Answers2

3

Let's measure more properly (not just measuring a single execution and also not including the setup) and include faster solutions:

300 ns  300 ns  300 ns  {*dict_1} - {*dict_2}
388 ns  389 ns  389 ns  {file for file in dict_1 if file not in dict_2}
389 ns  390 ns  390 ns  dict_1.keys() - dict_2
458 ns  458 ns  458 ns  set(dict_1) - set(dict_2)
472 ns  472 ns  472 ns  dict_1.keys() - dict_2.keys()
665 ns  665 ns  668 ns  set(dict_1.keys()) - set(dict_2.keys())
716 ns  716 ns  716 ns  {file for file in dict_1} - {file for file in dict_2}

Benchmark code (Try it online!):

import timeit

setup = '''
dict_1 = {"file_1":10, "file_2":20, "file_3":30, "file_4":40}
dict_2 = {"file_1":10, "file_2":20, "file_3":30}
'''

codes = [
    '{file for file in dict_1} - {file for file in dict_2}',
    'set(dict_1) - set(dict_2)',
    'set(dict_1.keys()) - set(dict_2.keys())',
    'dict_1.keys() - dict_2',
    'dict_1.keys() - dict_2.keys()',
    '{*dict_1} - {*dict_2}',
    '{file for file in dict_1 if file not in dict_2}',
]

exec(setup)
for code in codes:
    print(eval(code))

tss = [[] for _ in codes]
for _ in range(20):
    print()
    for code, ts in zip(codes, tss):
        number = 10000
        t = min(timeit.repeat(code, setup, number=number)) / number
        ts.append(t)
    for code, ts in sorted(zip(codes, tss), key=lambda cs: sorted(cs[1])):
        print(*('%3d ns ' % (t * 1e9) for t in sorted(ts)[:3]), code)
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
2

The fastest and most efficient way would be:

diff_set = {*dict_1} - {*dict_2}

Output:

{'file_4'}

Proof(Execution time comparison):

import timeit
    
dict_1 = {"file_1":10, "file_2":20, "file_3":30, "file_4":40}
dict_2 = {"file_1":10, "file_2":20, "file_3":30}

def method1():
    return {file for file in dict_1} - {file for file in dict_2}

def method2():
    return set(dict_1) - set(dict_2)

def method3():
    return set(dict_1.keys()) - set(dict_2.keys())

def method4():
    return dict_1.keys() - dict_2.keys()

def method5():
    return {*dict_1} - {*dict_2}


print(method1())
print(method2())
print(method3())
print(method4())
print(method5())

print(timeit.timeit(stmt = method1, number = 10000)/10000)
print(timeit.timeit(stmt = method2, number = 10000)/10000)
print(timeit.timeit(stmt = method3, number = 10000)/10000)
print(timeit.timeit(stmt = method4, number = 10000)/10000)
print(timeit.timeit(stmt = method5, number = 10000)/10000)

Output:

It took 1.6434900000149355e-06 sec for method 1
It took 8.317999999690073e-07 sec for method 2
It took 1.1994899999990594e-06 sec for method 3
It took 9.747700000389159e-07 sec for method 4
It took 8.049199999732082e-07 sec for method 5
Abhyuday Vaish
  • 2,357
  • 5
  • 11
  • 27
  • 1
    a microsecond might be due to a system process doing something. I would draw a conclusion that they are equivalent without more testing. Just my $.02. Awesome answer otherwise. – netskink Apr 20 '22 at 15:22
  • 1
    Well, as mentioned in mine: You'd better not include the setup in the time and you'd better not just measure just a single execution (I do 10000, repeat that five times, then repeat all that 20 times). With the three imports it also looks like you didn't run this as a single script but ran three scripts separately, in which case different salts of string hashing might influence the results (granted, that's a flaw in mine as well, but I did run it multiple times and the results were similar). Not sure what to explain about my code, is something unclear? – Kelly Bundy Apr 20 '22 at 18:10
  • @KellyBundy Thanks for the suggestions/advice. I have edited my answer. Kindly take a look at it and let me know if it is correct or not? – Abhyuday Vaish Apr 21 '22 at 05:10
  • Yes, much better. Still, some suggestions: 1) Nicer formatting of the results: not in scientific notation which forces us to check the exponent, and the many digits are rather distracting. 2) Run multiple rounds, in order to prevent stuff like the process not getting as much CPU share at the start as it does at the end when it's "warmed up", or randomly getting less CPU share during one of the methods. Also see the [note under `repeat`](https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat). 3) I wouldn't declare it as "The fastest". Maybe there's still something even faster. – Kelly Bundy Apr 21 '22 at 05:51