1

I have a pandas dataframe whose column contains dictionaries. I also have a query dictionary and I want to compute minimum sum of the values of the common keys.
For example

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] + dicta['b']
s2 = dictb['a'] + dictb['b']
result = min(s1, s2) = 2

I am using the following code to compute it.

def compute_common(dict1, dict2):

    common_keys = dict1.keys() & dict2.keys()
    im_count1 = sum((dict1[k] for k in common_keys))
    im_count2 = sum((dict2[k] for k in common_keys))
    return int(min(im_count1, im_count2))

Following are the timings for the operations on my i7 8 core machine with 8GB ram.

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)

%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using swifter is even slower(maybe because of the overhead of multiprocessing). I wanted to know if there is any way to squeeze more performance out of this operation.

You can use the following to replicate things.

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))

Thanks in advance.

3 Answers3

1

use a list comprehension to find the values for the common keys then sum the list results finding the min between the two dictionary summed common key values. The common_keys are appended to a list creating ['a','b']. The list comprehension then finds the values for a and b and sums them equaling 26 and 2. The min of 26 and 2 is 2.

def find_common_keys(dicta, dictb):
     '''
     >>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1, 
     'b': 1, 't': 34, 'g': 56, 'h': 67})
      2
      '''
    common_keys = [key  for key in dicta if key in dictb]

    s1 = sum(dicta[key] for key in common_keys)
    s2 = sum(dictb[key] for key in common_keys)
    return min(s1, s2)

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}

print(find_common_keys(dicta,dictb))

output

2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • Thanks @Golden Lion, but I don't understand how it improves the performance when applied to a pandas dataframe. – satinder singh Nov 02 '21 at 14:06
  • run a timeit and compare your code with this code. lambda functions are not necessarily faster. lambda fit the zen pattern of coding, but performance must always be measured. – Golden Lion Nov 02 '21 at 14:13
  • you are correct. Your function reduces the runtime by 10 milliseconds. Upvoting. – satinder singh Nov 02 '21 at 14:19
0

You can explode the dictionaries to dataframes and sum them

dict_data = pd.DataFrame(df['a'].tolist())

common_keys = dict_data.columns.intersection(dictb.keys())

dictb_sum = sum(dictb[k] for k in common_keys)

dicta_sum = dict_data[common_keys].sum(1)

# also     
output = dicta_sum.clip(upper=dictb_sum)

This is twice as fast as apply on my system. Note that this works if union(x.keys() for x in df['a']) is not too big, since that all the columns of dict_data, but large enough so you can utilize the vectorized .sum(1).

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

Following are some of my findings. Sharing them so that it helps someone else. Following are the optimizations I was able to achieve. I tried extending @Golden Lions idea.

  1. Just compiling the function using cython, gives a 10% performance boost.
  2. Since python is loosely typed, writing the cython function with types further increases the performance.
  3. Also since function calls in python are expensive, converting min(x1, x2) into x1 if x1 < x2 else x2 gives a performance benefit.

The final function I used gave me a 3x performance boost.

cpdef int cython_common(dict_1, dict_2):
    cdef dict dict1 = dict_1[0]
    cdef dict dict2 = dict_2[0]
    cdef list common_keys = [key  for key in dict1 if key in dict2]
    cdef int sum1 = 0
    cdef int sum2 = 0
    for i in common_keys:
        sum1 += dict1[i]
        sum2 +=dict2[i]
    return sum1 if sum1 < sum2 else sum2

Also, with some experiments I found out that libraries like pandarallel and swifter gave a speedup when the dataset has a large number of rows (for less number of rows I think the overhead of spawning processes and combining the results is much larger than compute itself.

Also this is a great read.