How to calculate the mean and standard deviation of similarity matrix?

Question

I am working with CSV files and I have a code that calculates the similarity between the documents. Post 1 provide the code and details of data and output is as follow:

The data.csv looks as:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

The output is:

    id     112    114    115    117
    id                             
    112  100.0   78.0   51.0   50.0
    114   78.0  100.0   47.0   54.0
    115   51.0   47.0  100.0   83.0
    117   50.0   54.0   83.0  100.0

Now I would like to calculate the mean and standard deviation of the lower triangular of the similarity matrix (since both upper and lower are similar) without the identity data (100.0).

I tried to use the panda built-in mean and std as:

df_std = df.std()
df_Mean = df.mean()

But this considers all the data in the output like identity and upper triangular.

I would like to know if there is any way that I can calculate the mean and standard deviation the way that I mentioned.

Chris · Answer 1 · 2019-06-17T04:12:51.280

2

Use numpy.tril with k=-1 and make 0s np.nan:

import numpy as np

ltri = np.tril(df.values, -1)
ltri = ltri[np.nonzero(ltri)]

Output:

array([[ 0.,  0.,  0.,  0.],
       [78.,  0.,  0.,  0.],
       [51., 47.,  0.,  0.],
       [50., 54., 83.,  0.]])

And now you can do ltri.std(), ltri.mean():

ltri.std(), ltri.mean()
# (14.361406616345072, 60.5)

edited Jun 17 '19 at 04:12

answered Jun 17 '19 at 02:28

Chris

29,127
3
28
51

thanks for the comment and code. I will appreciate it in advance. I have one question as well. Now the code is ready one CSV file and performing similarity between each idx. How I can perform the same similarity between 2 different documents? – Bilgin Jun 17 '19 at 02:50
@Bilgin Updated about the zeros. For your the question in your comment, I suggest you either edit the current question or post another question (recommended) with some examples :) – Chris Jun 17 '19 at 04:14

score 1 · Answer 2 · answered Jun 17 '19 at 02:33

1

You can do it with mask all of the unwanted value as np.nan

df.values[np.triu_indices_from(df.values,0)]=np.nan
df.mean()
112    59.666667
114    50.500000
115    83.000000
117          NaN
dtype: float64
df.std()
112    15.885003
114     4.949747
115          NaN
117          NaN
dtype: float64

After mask the value

df
      112   114   115  117
112   NaN   NaN   NaN  NaN
114  78.0   NaN   NaN  NaN
115  51.0  47.0   NaN  NaN
117  50.0  54.0  83.0  NaN

answered Jun 17 '19 at 02:33

BENY

317,841
20
164
234

thanks for the comment. How the entire upper or lower mean of triangular can be calculated. does it like an example``` df.mean(df.mean())``` ? – Bilgin Jun 17 '19 at 03:04

How to calculate the mean and standard deviation of similarity matrix?

2 Answers2