2

Need to find python function that works like this R func:

proxy::simil(method = "cosine", by_rows = FALSE) 

i.e. finds similarity matrix by pair-wise calculating cosine distance between dataframe rows. If NaNs are present, it should drop exact columns with NaNs in these 2 rows

Simil function description (R)

Python error because of NaNs

upd. I have also tried to delete NaNs in every pair of rows in loop using cosine func from scipy.spatial.distance. It gives the same result as in R, but works ages :(

Nadia
  • 29
  • 8

3 Answers3

1

You can swap NaN with 0 and try calculating cosine similarity then.

Novak
  • 2,143
  • 1
  • 12
  • 22
  • hi. I tried to do that. but it gives wrong result (different from the one which R simil function gives) – Nadia Jan 28 '19 at 08:05
1

You can try this approach: https://github.com/Midnighter/nadist, alternatively you can use _chk_weights with nan_screen=True as described here by metaperture here https://github.com/scipy/scipy/issues/3870, hope that helps.

I have found that Midnighter had posted the same problem previously on stackoverflow: Compute the pairwise distance in scipy with missing values. There are some other solutions there but, as he moved on to cytonize it I bet they were not the best.

Artyom
  • 209
  • 2
  • 5
1

I solved the problem by creating a mask (boolean array indicating which values are missing) and calculating pairwise cosine distances between row-vectors of matrix. As a result I received a long vector of similarities, which I then pivoted to get the similarity matrix

Nadia
  • 29
  • 8