I am have two dataframes,
df1 = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75], 'd': [1.5, 2.5],'e': [0.25, 2.75], 'f': [1.25, 0.75]})
df2 = pd.DataFrame({'a': [1.5, 2.5,3.5,4.5], 'b': [0.25, 1.5, 2.5, 2.75], 'c': [1.25, 0.75, 3.5, 4.5], 'd': [1.5, 2.5, 3.5, 4.5],'e': [0.25, 2.75, 1.5, 3.5], 'f': [1.25, 0.75, 2.5, 4.5]})
For every row in df1, I want to find the distance of that row with all the rows of df2 for specific columns. After finding the distance, I want to find the minimum distance for that individual row among all and return the corresponding 'e' value of df2.
For eg, If I pass a and b columns, For each row of df1, I want to find distance between a and b for all the rows in df2 and find the minimum distance of all and get the corresponding 'e' value of df2.
I am using the following two functions,
def distance(x1, x2, L):
start_time = time.time()
dist = (np.sum((np.array(x1)-np.array(x2))**L))**(1/(float(L)))
print("Time taken: " + str(round(time.time() - start_time,2)) + " seconds")
return dist
def mindistance(data1,data2,variables,L):
start_time = time.time()
pred_values=[]
test1=[]
for index2, row2 in data2.iterrows():
test=[]
for index1, row1 in data1.iterrows():
a=distance(row2[variables],row1[variables],L)
test.append(a)
#print(test)
index=test.index(min(test))
#print(index)
b=round(data1['e'].iloc[index],2)
pred_values.append(b)
print(pred_values)
print(len(pred_values))
return "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"
print mindistance(df2, df1,['a','b'],2)
This functions are working fine. But there is a huge efficiency problem in this code. The distance part is taking a long time. Suppose if I have around 60000 iterations on the whole to be done based on my original dataframe it is taking more than one minute to compute it. I have tried line by line debugging and most of the time is taken in the a=distance(row2[variables],row1[variables],L)
line. Can anybody help me in making the code efficient?