1

I calculated the cosine similarity of a dataframe similar to the following:

ciiu4n4  A0111  A0112  A0113   
 A0111      14      7      6 
 A0112      16     55      3 
 A0113      15      0    112 

using this code:

data_cosine = mpg_data.drop(['ciiu4n4'], axis=1)
result = cosine_similarity(data_cosine)

I get as a result an array like this:

[[ 1.          0.95357118  0.95814892 ]
 [ 0.95357118  1.          0.89993795 ]
 [ 0.95814892  0.89993795  1.         ]]

However, I need the result as a dataframe similar to the original one. I can't do it manually, because the original dataframe is 600 x 600.

The result that I need needs to look something similar like:

ciiu4n4   A0111        A0112        A0113       
 A0111    1.           0.95357118   0.95814892
 A0112    0.95357118   1.           0.89993795
 A0113    0.95814892   0.89993795   1.  
cs95
  • 379,657
  • 97
  • 704
  • 746
PAstudilloE
  • 659
  • 13
  • 24

1 Answers1

2

I'd recommend changing your approach slightly. No need to drop any columns. Instead, set the first column as the index, compute cosine similarities, and assign the result array back to the dataframe.

df = df.set_index('ciiu4n4')
df

         A0111  A0112  A0113
ciiu4n4                     
A0111       14      7      6
A0112       16     55      3
A0113       15      0    112

v = cosine_similarity(df.values)

df[:] = v
df.reset_index()

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000

The solution above only works when the number of rows and columns (excluding the first) are the same. So, here's another solution that should generalise to any scenario.

df = df.set_index('ciiu4n4')
v = cosine_similarity(df.values)

df = pd.DataFrame(v, columns=df.index.values, index=df.index).reset_index()
df

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000

Or, using df.insert -

df = pd.DataFrame(v, columns=df.index.values)
df.insert(0, 'ciiu4n4', df.index)
df

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Tks @COLDSPEED. I'm getting an error now: 1 df[:] = v "ValueError: Must have equal len keys and value when setting with an ndarray" – PAstudilloE Jan 07 '18 at 00:12
  • @PAstudilloE The idea is that all the columns that are not involved in the calculation must be set as the index. So please do that. – cs95 Jan 07 '18 at 00:13
  • @COLDSPEED, the only column not involved in the calculation is ciiu4n4 and now it's set as index. But I'm still getting the same mistake. =( – PAstudilloE Jan 07 '18 at 00:18
  • @PAstudilloE Please print `df.shape` and `v.shape`...? – cs95 Jan 07 '18 at 00:19
  • @COLDSPEED df.shape (390, 414), v.shape (390,390) – PAstudilloE Jan 07 '18 at 00:22
  • @PAstudilloE So... you have 414 columns. That would imply 24 columns are not involved in the calculation. Am I wrong? – cs95 Jan 07 '18 at 00:23
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162670/discussion-between-pastudilloe-and-cs). – PAstudilloE Jan 07 '18 at 00:24