I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.
I have load it into a dataframe of pandas as follows:
old_df['Vector']=old_df.apply(lambda row:
np.array(np.matrix(row.Vector)).ravel(), axis = 1)
l=[]
for a in old_df['Vector']:
l.append(a)
A=np.array(l)
similarities = cosine_similarity(A)
The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.
Could you pls help me to solve this issue.
Thank you.
|Index | GUID | Vector |
|-------|-------|---------------------------------------|
|36099 | b770 |[-0.04870541 -0.02133574 0.03180726] |
|36098 | 808f |[ 0.0732905 -0.05331331 0.06378368] |
|36097 | b111 |[ 0.01994788 0.00417582 -0.09615131] |
|36096 | b6b5 |[0.025697 -0.08277534 -0.0124591] |
|36083 | 9b07 |[ 0.025697 -0.08277534 -0.0124591] |
|36082 | b9ed |[-0.00952298 0.06188576 -0.02636449] |
|36081 | a5b6 |[0.00432161 0.02264584 -0.0341924] |
|36080 | 9891 |[ 0.08732156 0.00649456 -0.02014138] |
|36079 | ba40 |[0.05407356 -0.09085554 -0.07671648] |
|36078 | 9dff |[-0.09859556 0.04498474 -0.01839088] |
|36077 | a423 |[-0.06124249 0.06774347 -0.05234318] |
|36076 | 81c4 |[0.07278682 -0.10460124 -0.06572364] |
|36075 | 9f88 |[0.09830415 0.05489364 -0.03916228] |
|36074 | adb8 |[0.03149953 -0.00486591 0.01380711] |
|36073 | 9765 |[0.00673934 0.0513557 -0.09584251] |
|36072 | aff4 |[-0.00097896 0.0022945 0.01643319] |