I have a pandas dataframe whose column contains dictionaries. I also have a query dictionary and I want to compute minimum sum of the values of the common keys.
For example
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] + dicta['b']
s2 = dictb['a'] + dictb['b']
result = min(s1, s2) = 2
I am using the following code to compute it.
def compute_common(dict1, dict2):
common_keys = dict1.keys() & dict2.keys()
im_count1 = sum((dict1[k] for k in common_keys))
im_count2 = sum((dict2[k] for k in common_keys))
return int(min(im_count1, im_count2))
Following are the timings for the operations on my i7 8 core machine with 8GB ram.
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using swifter is even slower(maybe because of the overhead of multiprocessing). I wanted to know if there is any way to squeeze more performance out of this operation.
You can use the following to replicate things.
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
Thanks in advance.