The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.
I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:
A1 A2 ... A40
0 xx xx ... xx
1 xx xx ... xx
2 xx xx ... xx
... ...
399,999 xx xx ... xx
I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data
.
So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.
I then use dis = numpy.linalg.norm(a-b)
to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.
What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0
to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.
result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
dis = numpy.linalg.norm(X_data[0]-X_data[i])
result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value
The result look like this, it shows the 5 most similar customers of customer0
:
index1 index2 distance
0 0 206391 0.004
1 0 314234 0.006
2 0 89284 0.007
3 0 124826 0.012
4 0 234513 0.013
So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0
, not to mention all the customers. What should I do to get it faster? Any ideas?