1

I have a dataframe of 80k rows with numerical and categorical data. I "train" a KNN with only 1k rows and the gower distance (rows that have a "to_predict" value), and then I want to assign the remaining 79k rows to the defined knn model (to complete the "to_predict" value, which is null originally).

In R, I am able to do it in reasonable time, but the problem appears in Python, it takes infinitely:

knnsize = 1000
data_knn1 = data.iloc[random.sample(range(0,len(data)),knnsize),:]

data_knn1_no = data_knn1.drop(['to_predict'], axis = 1)
data_knn1_with = data_knn1[['to_predict']]

data_knn2_no = data.loc[data['to_predict'].isna()]
data_knn2_no = data_knn2_no.drop(['to_predict'], axis = 1)


data_gower = data_knn1_no.append(data_knn2_no)

dist_matrix = gower.gower_matrix(np.asarray(data_gower))

indknn = []

for j in range(0,len(dist_matrix.columns)):
    indknn.append(np.where(dist_matrix.iloc[:,j]==min(dist_matrix.iloc[:,j])))

new_data = data_knn1con.iloc[indknn,:]
new_data = new_data[['to_predict']]

data.loc[data['to_predict'].isna(),['to_predict']] = new_data

I guess it is because I am transforming a panda structure to an array structure, and then iterating over the array (cache misses).

Is there any way to do it directly in Python, over a panda's dataframe? or any way to do it efficiently???

Thanks in advance

alvarella
  • 81
  • 3
  • could you provide a small dataframe example? See https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Guillermo Mosse May 15 '20 at 10:29
  • Hi, the program is working well and it does not break, the problem is related to performance when the dataset is "big". If it helps I can provide it, but it is more about finding a efficient function – alvarella May 18 '20 at 07:50

0 Answers0