I want know what cause this performance issue.
Issue: CPU 100% and take several hours to finish.
df1.size : 2.5m
df2.size : 264k
df1:
Index B
A1 B1
A1 B3
A2 B2
...
A3 B1
A3 B4
A4 B7
A5 B3
df2:
Index C
A1 C1
A1 C2
A2 C1
...
A3 C1
A3 C2
A4 C1
I want use the index of df2 (NOT unique) to match the same value in index of df1 (NOT unique) to get permutations of B(Bx) and C(Cx)
My code:
//This operation DO NOT have performance issue
//get Intersection of Index of df1 and df2 to avoid Exception
Index_df2_deduplicated = df2.index.drop_duplicates()
Index_FullList = []
for i in range(Index_df2_deduplicated.size):
Index_FullList.append(Index_df2_deduplicated[i])
IntersectionIndexs = df1.index.intersection(Index_FullList).drop_duplicates()
//This cause CPU 100% and take several hours to finish.
i = 0
for i in range(IntersectionIndexs.size):
B = df1.loc[IntersectionIndexs[i],'B']
C = df2.loc[IntersectionIndexs[i],'C']
if isinstance(B, (unicode)) == True:
B = [B]
elif isinstance(B, (pd.core.series.Series)) == True:
B = B.drop_duplicates().reset_index(drop=True).tolist()
if isinstance(C, (unicode)) == True:
C = [C]
elif isinstance(C, (pd.core.series.Series)) == True:
C = C.drop_duplicates().reset_index(drop=True).tolist()
lists = [B, C]
Output = pd.DataFrame(list(itertools.product(*lists)), columns=['B', 'C'])
Output.to_csv("output.txt", mode='a', index=False, header=False)