3

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

enter image description here

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

limone
  • 279
  • 2
  • 9
  • 1
    For comparing histograms, [earth mover’s distance](https://en.m.wikipedia.org/wiki/Earth_mover's_distance) would be more appropriate. – Cris Luengo Jun 26 '23 at 17:51
  • @CrisLuengo, thanks a lot!! I did not know about this distance as measure of (dis)similarity of histograms.....I will try it for sure! ....but, just a question... in your opinion, why do you think this measure "would be more appropriate"? :-) – limone Jun 27 '23 at 10:57
  • 1
    More appropriate than the Jaccard measure for sure. There are several measures commonly used to compare histograms, this is one of them. If you add a bit of noise to the data, then the histogram could see values move from one bin to its neighbor. These changes add a small amount to the earth mover’s distance, but might add a larger amount to something like cosine dissimilarity. The Kullback-Leibler divergence, mentioned in Luis’ answer, is good too. Or you can use what they use in the Kolmogorov-Smirnoff test: the largest difference between the two cumulative histograms. – Cris Luengo Jun 27 '23 at 14:16
  • Great comment @CrisLuengo! Thanks a lot :-) – limone Jun 27 '23 at 14:22

1 Answers1

3

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

bin_counts_b = bin_counts_a + 1;

Then

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.

Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • 1
    many many thanks!! Great answer :-) Yes, I was thinking about the Kullback-Leibler divergence, but I was not sure if appropriate for this case.... I tried also the Chi-Square Test, the Kolmogorov-Smirnov one-sample test and the Log-likelihood to compare the histograms... but I have noticed that the Chi-Square Test is quite "strict" (when using the usual "significance level" of 0.05%), which means that the two histograms are considered different (indeed I got the rejection of the null hypothesis)... therefore, I was looking for less "strict" measure of similarity.. – limone Jun 27 '23 at 11:04
  • "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions" by Sung-Hyuk Cha (http://www.fisica.edu.uy/~cris/teaching/Cha_pdf_distances_2007.pdf). This is a great review of (dis)similarity measures, but too many options and it is not easy to select the right ones for histograms :-) :-) Btw, I was thinking to focus on one type of measures, i.e. the "Table 4. Inner Product family". – limone Jun 27 '23 at 11:08