I have a dataframe of 80k rows with numerical and categorical data. I "train" a KNN with only 1k rows and the gower distance (rows that have a "to_predict" value), and then I want to assign the remaining 79k rows to the defined knn model (to complete the "to_predict" value, which is null originally).
In R, I am able to do it in reasonable time, but the problem appears in Python, it takes infinitely:
knnsize = 1000
data_knn1 = data.iloc[random.sample(range(0,len(data)),knnsize),:]
data_knn1_no = data_knn1.drop(['to_predict'], axis = 1)
data_knn1_with = data_knn1[['to_predict']]
data_knn2_no = data.loc[data['to_predict'].isna()]
data_knn2_no = data_knn2_no.drop(['to_predict'], axis = 1)
data_gower = data_knn1_no.append(data_knn2_no)
dist_matrix = gower.gower_matrix(np.asarray(data_gower))
indknn = []
for j in range(0,len(dist_matrix.columns)):
indknn.append(np.where(dist_matrix.iloc[:,j]==min(dist_matrix.iloc[:,j])))
new_data = data_knn1con.iloc[indknn,:]
new_data = new_data[['to_predict']]
data.loc[data['to_predict'].isna(),['to_predict']] = new_data
I guess it is because I am transforming a panda structure to an array structure, and then iterating over the array (cache misses).
Is there any way to do it directly in Python, over a panda's dataframe? or any way to do it efficiently???
Thanks in advance