5

I have an item-item matrix (1877 x 1877). The values in the matrix represent the number of times two items occurred together. How can I determine the similarities between two items? Through reading, i found few options. However i am not sure about these approaches. Any inputs to get started is appreciated.

  1. Use cosine to compute sim between two vectors
  2. Turn this into a graph, use measures like simrank to compute similarity - may use the occurrence count as a weight between two nodes.
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30

3 Answers3

3

I would recommend using spatial cosine similarity. Alternatively you could calculate jaccard's similarity for each item pair.

After calculating either similarity matrix (affinity matrix) you can use a spectral (or spatial) clustering algorithm, such as sklearn's spectral clustering algorithm to group those items.

Nico
  • 743
  • 7
  • 19
1

You can thread it as 1877 items with 1877 features each. If two items are similar, than they co-occurrences will be similar. Given that you might use NearestNeighbors in order to find closest one. There are may metrics available.

Also, reprocessing the data may help you. I do not know it's distribution but you might want to normalize values into range [0;1] or doing sth like that.

mbednarski
  • 758
  • 1
  • 9
  • 17
  • if i get you right, i use cosine to compute sim between columns in my matrix? columns are regarded as features..regarding normalization, are you referring to rescaling each column to have a length of 1? from sklearn.preprocessing import *; normalized_X = normalize(X, axis=0, norm='l1') – kitchenprinzessin Feb 01 '17 at 09:44
1

If your co-nonoccurence matrix is symmetrical, you don't need to normalize it. You can refer to this paper for gain more information about normalization of symmetrical and asymmetrical co-matrices: Leydesdorff, L. and Vaughan, L., 2006. Co‐occurrence matrices and their applications in information science: Extending ACA to the Web environment. Journal of the American Society for Information Science and technology, 57(12), pp.1616-1628. please, click hear

Hamed Baziyad
  • 1,954
  • 5
  • 27
  • 40