2

I have been working with some R packages that calculate (cosine) (sparse) similarity matrices from sparse binary matrices, e.g. proxyC.

As I am now starting (and learning) to use python as well, and I was told it might even be faster, I would like to try and run the same calculations there.

I found this interesting post:

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

which describes a few methods.

I did try some of them out after writing out a small test matrix myself by hand.
Now I would like to try on 'real' data.
And that's where I encounter a problem I currently cannot solve.

My data come in tsv files that associate objects (ID's) to comma-separated lists of features (FP's). E.g.:

ID  FP
1   A,B,C
2   A,D
3   C,D,F
4   A,F
5   E,H,M

I need to convert this to a sparse binary matrix.
Even in R it took me some time to figure out the best way to do it.
I first strsplit the FP lists by comma, turning the FP column from a character vector to a list of character vectors. Then I unlist FP, repeating each ID as many times as the lengths of the FP vectors, which gives me this:

ID  FP
1   A
1   B
1   C
2   A
2   D
3   C
3   D
3   F
4   A
4   F
5   E
5   H
5   M

And I make the sparse binary feature matrix by xtabs:

5 x 8 sparse Matrix of class "dgCMatrix"
    FP
  ID A B C D E F H M
   1 1 1 1 . . . . .
   2 1 . . 1 . . . .
   3 . . 1 1 . 1 . .
   4 1 . . . . 1 . .
   5 . . . . 1 . 1 1

I am sure it is possible to do this in python (in this case going from the tsv file to a csr matrix, as in the post I linked), but I am still a beginner, and I suspect it would take me a very long time to figure out all the details and get it right.

Would anyone be able to help / point me to posts describing the necessary steps with examples?

Thanks!

user6376297
  • 575
  • 2
  • 15

1 Answers1

1
import pandas as pd
df = pd.DataFrame({'ID':[1,2,3], 'FP':["A,B,C","A,D","C,D,F"]})

>>> df
   ID     FP
0   1  A,B,C
1   2    A,D
2   3  C,D,F

Split the column and explode it to a long table

df['FP'] = df['FP'].str.split(",")
df = df.explode(column="FP")

>>> df
   ID FP
0   1  A
0   1  B
0   1  C
1   2  A
1   2  D
2   3  C
2   3  D
2   3  F

Encode the categorical column

df['FP'] = df['FP'].astype('category')

Write it into a sparse matrix:

from scipy.sparse import csr_matrix
import numpy as np

mat = csr_matrix((np.ones(df.shape[0]), (df['ID'], df['FP'].cat.codes)))

>>> mat.A
array([[0., 0., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 1.]])

Make sure to keep track of which columns are which categorical levels. You can also encode the ID column if you'd prefer (if they're not 0-indexed integers it might be a good idea).

df['ID'] = df['ID'].astype('category')
mat = csr_matrix((np.ones(df.shape[0]), (df['ID'].cat.codes, df['FP'].cat.codes)))

>>> mat.A
array([[1., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 1.]])

Again, keep track of your categorical levels.

CJR
  • 3,916
  • 2
  • 10
  • 23
  • Thanks! I confirm that it would have taken me ages to work this out. I must definitely turn into category the ID column too, otherwise I have an unwanted row of 0's. To make sure I don't scramble the similarity matrix, I will need to sort the initial data frame by ascending ID (that was the case in R too, BTW). I still have a doubt: shouldn't I set the dtype of the ones to an integer, as it can never be a float, or even a boolean, considering that I only care about the presence of a feature, not how many times it appears in a single ID? Also in view of speeding up the similarity calculation. – user6376297 Apr 19 '21 at 18:05
  • You can set the dtype to whatever you want (e.g. `np.ones(df.shape[0], dtype=bool)`) . You can also always get the order of the categorical factors (e.g. `df['FP'].cat.categories` gives you the axis labels for the columns). You could keep track of everything in a sparse dataframe as well, but it's more limited for math (e.g.: `df_sparse = pd.DataFrame.sparse.from_spmatrix(mat, index = df['ID'].cat.categories, columns = df['FP'].cat.categories)`) – CJR Apr 19 '21 at 19:09