4

I basically have a dataframe (df1) with columns 7 columns. The values are always integers. I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:

import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
         data = np.random.randint(1,5,(100,7)))

df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
         data = [['Alfred','Chicago', 
                  np.random.randint(1,5,(100,7))],
                ['Nicola','New York', 
                  np.random.randint(1,5,(100,7))]])

I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:

df2['Overlap'] = 0.
for i in range(len(df2)):
    c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist()) 
    df2.at[i, 'Overlap'] = c/len(df1)

Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?

alex22200
  • 41
  • 4

2 Answers2

3

By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.

Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.

Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast

import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
         data = np.random.randint(1,5,(100,7)))

df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
         data = [['Alfred','Chicago', 
                  np.random.randint(1,5,(100,7))],
                ['Nicola','New York', 
                  np.random.randint(1,5,(100,7))]])

a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())

# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
#     sum = 0
#     for i in nb.prange(x.shape[0]):
#         for j in range(a.shape[0]):
#             sum += (x[i] == a[j]).sum()
#     return sum


# Experiment with engine
print(df2['Sequence'].apply(f))
eroot163pi
  • 1,791
  • 1
  • 11
  • 23
  • Thanks for your reply, very helpful! What if the lengths of the array of arrays differ? Let's say df1 has (50,7) and df2 has (100,7)? – alex22200 Aug 22 '21 at 13:13
  • You can write a function using normal for loops and use numba jit decorater on that function, their will be no decrease in speed, as numba also has efficient SIMD vectorizer in background – eroot163pi Aug 22 '21 at 13:29
  • thanks for the reply. Would you be so kind and provide an example? Thanks! – alex22200 Aug 22 '21 at 13:30
  • Look at this question https://stackoverflow.com/questions/68591676/why-are-np-hypot-and-np-subtract-outer-very-fast, it uses normal for loop, look at answer also, there are two for loops and numba decorater. Just follow the same method – eroot163pi Aug 22 '21 at 13:32
  • @alex22200 I added an example in comments logic is not correct just illustration, you can alter it to your needs. – eroot163pi Aug 22 '21 at 13:48
2

You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:

df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

output:

0    0.270000
1    0.298571

To save the output in your original dataframe:

df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
mozway
  • 194,879
  • 13
  • 39
  • 75