2

I want know what cause this performance issue.

Issue: CPU 100% and take several hours to finish.

df1.size : 2.5m

df2.size : 264k

df1:

Index B
A1  B1
A1  B3
A2  B2
...
A3  B1
A3  B4
A4  B7
A5  B3

df2:

Index C
A1  C1
A1  C2
A2  C1
...
A3  C1
A3  C2
A4  C1

I want use the index of df2 (NOT unique) to match the same value in index of df1 (NOT unique) to get permutations of B(Bx) and C(Cx)

My code:

//This operation DO NOT have performance issue
//get Intersection of Index of df1 and df2 to avoid Exception
Index_df2_deduplicated = df2.index.drop_duplicates()
Index_FullList = []
for i in range(Index_df2_deduplicated.size):
    Index_FullList.append(Index_df2_deduplicated[i])
IntersectionIndexs = df1.index.intersection(Index_FullList).drop_duplicates()

//This cause CPU 100% and take several hours to finish.
i = 0
for i in range(IntersectionIndexs.size):
    B = df1.loc[IntersectionIndexs[i],'B']
    C = df2.loc[IntersectionIndexs[i],'C']
    if isinstance(B, (unicode)) == True:
        B = [B]
    elif isinstance(B, (pd.core.series.Series)) == True:
        B = B.drop_duplicates().reset_index(drop=True).tolist()
    if isinstance(C, (unicode)) == True:
        C = [C]
    elif isinstance(C, (pd.core.series.Series)) == True:
        C = C.drop_duplicates().reset_index(drop=True).tolist()
    lists = [B, C]
    Output = pd.DataFrame(list(itertools.product(*lists)), columns=['B', 'C'])
    Output.to_csv("output.txt", mode='a', index=False, header=False)

1 Answers1

1

Pandas is shipped with Cartesian join ability:

pd.merge(df1, df2, on="key")

Your code involves creating too many unnecessary object explicitly. Moreover, operations on such objects can be VERY costly in Python as Python is a dynamically typed language, especially for iteration and merging.

Some immediate advice:

  • Don't reinvent the wheel. Use library-provided functions as much as possible because they are optimized to a reasonable degree by those experienced library authors.
  • ALWAYS ask StackOverflow or google first if you feel your task belongs to some sort of generalizable logic. For example, search "Cartesian join pandas" on google and you are likely to find this answer, which would save you tons of time.

N.B. The data cleansing part is ignored, as it is not the main course of this question.

Code

import pandas as pd
# data
df1 = pd.DataFrame(
    data={
        "Index": ['a1', 'a1', 'a2', 'a3', 'a3', 'a3',],
        "B": [1,2,3,4,5,6]
    }
)
df2 = pd.DataFrame(
    data={
        "Index": ['a1', 'a1', 'a2', 'a2', 'a3', 'a3'],
        "C": [.1,.2,.3,.4,.5,.6]
    }
)
df1 = df1.set_index("Index")
df2 = df2.set_index("Index")

# task
df_ans = pd.merge(df1, df2, on="Index")

Output

df

# as eapected: 2*2 a1's + 1*2 a2's + 3*2 a3's in order
Out[5]: 
       B    C
Index        
a1     1  0.1
a1     1  0.2
a1     2  0.1
a1     2  0.2
a2     3  0.3
a2     3  0.4
a3     4  0.5
a3     4  0.6
a3     5  0.5
a3     5  0.6
a3     6  0.5
a3     6  0.6
Bill Huang
  • 4,491
  • 2
  • 13
  • 31