0

Im trying to calculate the cosine distance between 2 vectors via scipy and the manual way. As a reference i tried to use the first and second answer out of this thread:

Scipy Code:

distance = dist.cosine(v1,v2)

Selfmade Code

distance =  np.inner(v1,v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

When printing both results however, they are vastly different, with the first one being the result of my code and the second one of scipy:

1.1443029291450655e-05

0.369629880560304

I am aware that i am technically missing a 1- before my own calculation to match the one in scipy, but even if i added it, the result would even be close to matching.

Why are the results so different? Where did i calculate wrong?

Edit:

Many suggest that the 1- is whats missing however as seen below the results are not even in the same range.

My Selfmade Code is (depending on the added 1- or not) in the range x<0.01 or x>0.99. Scipy however is grossly in the range between 0.1<x<0.9.

I have added 2 example inputs and corresponding results at the bottom.

Selfmade Result

2.0613046661121477e-05

Scipy Result

0.17675768302346695

EXAMPLE INPUT

v1

[ 97  99 104 109 105 101 100  98 103 115 122 127 136 137 143 146 151 157
 171 175 178 185 198 207 213 215 220  92  93  97  98  89  79  76  77  85
  95 102 110 118 126 127 131 141 151 162 164 180 184 191 204 212 214 215
  85  89  90  79  66  56  57  61  64  65  68  74  83  88  97 112 124 129
 140 151 160 167 177 187 193 193 200  80  80  73  62  51  46  46  50  50
  46  43  47  51  58  68  78  83  91 102 111 123 137 150 156 164 170 179
  73  66  55  44  38  33  34  37  40  38  34  33  31  34  37  38  45  51
  55  70  83  95  99 115 130 136 137  57  48  41  29  23  27  32  30  31
  29  30  26  20  17  18  17  19  23  28  32  36  47  48  59  82  97 108
  55  44  32  24  26  29  30  26  25  22  21  19  16  18  15  13  14  15
  16  18  20  22  27  29  40  54  77  44  33  24  25  26  26  30  29  27
  25  24  19  17  19  18  18  16  15  16  16  16  14  19  22  25  35  58
  36  31  24  24  30  36  42  50  52  50  51  51  51  46  39  35  30  23
  23  20  19  17  17  19  23  26  38  35  31  28  35  49  57  64  73  80
  86  91  89  92  95  85  70  57  48  39  34  30  28  24  24  27  30  30
  35  35  45  56  71  81  91 102 103 108 114 114 117 127 126 114 101  84
  68  62  56  52  45  43  46  45  41  44  51  62  73  84  92 101 109 113
 117 127 134 139 146 151 152 145 131 115 104  99  94  84  70  65  64  61
  49  59  65  72  82  90 103 110 117 123 132 143 155 166 173 179 176 167
 160 144 132 119 108  97  87  80  75  49  58  66  75  86  94 106 116 121
 125 132 142 155 177 187 188 190 183 182 176 162 149 132 120 111 105  99
  59  64  73  84  93 102 108 120 127 128 132 141 159 177 188 187 186 187
 189 183 176 168 154 140 131 122 113  66  71  79  87  98 103 108 121 129
 135 136 142 156 165 172 176 180 181 177 168 164 166 159 152 149 143 131
  68  77  86  92 101 106 109 117 123 131 130 137 143 142 147 153 155 154
 146 142 143 148 148 148 149 148 141  72  79  87  93  98 102 106 111 116
 124 125 129 131 134 141 147 145 140 132 125 124 128 131 127 128 133 134
  72  78  85  91  95  94 101 105 111 119 124 130 131 136 139 139 135 128
 120 114 114 115 115 113 113 114 120  74  76  85  91  91  91  95 100 108
 111 105  95  87  78  68  57  48  40  33  39  46  63  84  96 101 103 108
  74  76  83  85  83  83  81  79  68  51  35  28  33  36  18   7   6  43
  93  24   4   8  26  53  79  96 101  77  76  78  78  74  64  47  30  20
  23  38  69 108  91  24   7   3  79 167  38   7  11  34  61  50  74  98
  75  74  72  65  51  33  21  25  42  63  83 123 152 106  24   8   8  16
  22  13  12  16  64 138 102  53  88  75  69  62  50  42  38  35  49  79
  97 108 145 156 115  37   9   9   8  10  11  13  25  90 170 163  92  67
  80  69  58  56  60  55  48  62  79  95 114 137 149 137  78  19  11   9
   9  12  20  48 128 183 177 178 110  83  73  65  67  71  63  55  69  84
  92 102 119 127 129 114  59  28  29  49  34  47  94 137 141 130 141 100
  91  83  78  81  86  75  63  72  88  96  97  96  99 110 109 102  96 109
 131 108 116 121 111 108 113 109 101 105  96  93  92  96  90  75  77  88
  95 101 103 107 112 109 111 123 144 140 138 144 141 125 119 126 134 137
 115 104 100  98  99  96  87  87  92 103 114 120 129 128 127 127 134 148
 143 138 133 122 113 106 110 127 146]

v2

[101 103 103 101  95  88  81  81  81  80  74  73  79  87  90  92  98 100
  97  91  85  70  59  51  40  31  26 106 109 107 105  98  90  84  82  80
  78  73  73  80  87  87  90  98 101  97  90  83  73  62  50  40  32  26
 112 113 111 105  98  87  86  83  80  78  75  75  81  87  89  94  99  99
  96  89  81  72  59  49  38  31  29 121 118 113 107  97  89  88  83  80
  77  73  74  78  82  84  90  96  93  91  90  84  71  57  45  38  29  26
 125 122 115 108  98  88  86  86  82  77  73  72  75  76  78  85  93  93
  87  83  78  68  59  44  37  28  24 125 123 116 106  99  92  90  85  79
  73  71  71  70  72  77  83  86  87  81  74  67  62  55  43  36  29  23
 129 120 110 102  99  94  92  85  76  69  67  65  66  68  75  77  76  75
  75  71  63  59  54  41  35  28  22 128 115 108 103 102 101  93  80  73
  67  66  63  62  64  68  70  68  67  68  65  58  51  48  40  34  29  24
 127 118 114 110 114 107  87  71  64  61  60  62  62  62  64  68  68  68
  66  64  59  50  45  40  34  30  25 129 129 125 123 126 104  75  56  53
  54  56  61  62  61  62  67  68  70  66  64  62  56  47  40  35  30  26
 135 135 134 141 131  89  55  45  50  53  58  64  62  62  68  72  73  74
  76  73  66  62  54  44  37  32  27 144 145 151 150 122  68  45  46  52
  55  61  65  65  67  73  79  84  89  86  81  73  70  60  50  44  34  28
 148 157 162 149 104  52  47  53  57  62  66  69  71  75  83  90  95  94
  91  86  83  80  71  61  56  45  33 157 164 163 140  83  51  57  61  65
  72  76  78  83  87  97 102 104 104 102  99  93  92  85  73  66  57  44
 157 165 161 127  67  60  66  71  79  84  85  90  96  99 110 122 127 125
 120 120 109 102  93  82  76  69  57 159 170 154 110  67  69  75  87  97
  98  94  99 107 112 121 140 160 158 144 135 124 115 104  93  83  74  61
 158 155 146  97  76  83  91 104 114 113 104 103 106 116 126 141 174 183
 166 148 128 115 104  96  85  75  63 157 147 135  96  83  97 108 117 129
 126 117 111 105 111 121 140 163 168 156 144 128 116 106  95  81  72  67
 158 146 129 103  93 109 117 124 130 125 113 106 106 108 117 136 145 148
 140 132 126 120 113  97  76  72  77 154 139 124 108 100 112 117 108  97
  93  73  61  80 103 114 130 139 138 130 123 122 119 111  93  74  71  89
 141 133 123 109  97 103 107  95 100 102  83  79  87  99 108 120 127 126
 121 115 116 111 101  84  65  63  84 133 129 116 104  92  90 101 106 115
 111 100 102 109 117 113 113 122 122 117 110 107 105  94  75  60  63  78
 126 119 106 100  92  80  78  87  94  94  94  96 110 129 135 123 117 112
 108 103 103 103  95  77  59  54  66 119 106  98  95  90  83  77  74  73
  70  75  89 110 130 141 140 129 116 110 109 113 115 105  80  52  36  41
 105  99  96  92  89  87  83  81  77  72  76  91 110 125 140 150 138 130
 126 126 131 126 109  81  47  36  47  99  97  95  95  91  88  84  84  82
  79  84  97 114 126 138 148 141 135 139 141 141 132 110  83  54  45  54
  99  97  96  95  90  84  83  87  86  85  89  98 114 122 133 140 136 131
 139 144 139 124 106  83  63  53  57  99  98  97  95  89  86  86  86  87
  89  92  98 108 114 127 128 122 121 123 128 129 121 101  82  67  57  58
 100  97  93  92  91  92  88  85  90  92  95 102 109 112 120 117 114 118
 116 120 124 115  98  84  72  63  59]
Typo
  • 47
  • 7
  • 2
    Adding the `1 -` seems to work for me. What inputs did you try it with? – rchome Nov 28 '21 at 17:44
  • 1
    Like @rchome, I can't reproduce your result. It will be easier for someone to help you if you provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Warren Weckesser Nov 28 '21 at 18:11
  • Your self-made code is computing the cosine **similarity**. The scipy code is computing the cosine **distance**. – Stef Nov 29 '21 at 11:05
  • Ive added examples to my Post. Im not sure if similarity/distance is the problem here, since both results arent even in the same range – Typo Nov 29 '21 at 15:59

1 Answers1

0

scipy defines cosine similarity as 1 - dot product of two vectors over the product of the square roots of the sum of all the elements squared. It might be easier to see the formula on wikipedia than read my description of it.

Your code seems to be missing the "1 -" part.

a = [random.randint(1, 100) for _ in range(100)]
b = [random.randint(1, 100) for _ in range(100)]

1 - np.inner(a,b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 0.2412...

dist.cosine(a, b)
# 0.241286...
inteoryx
  • 785
  • 2
  • 9
  • scipy defines the cosine **distance** as 1 - dot product of two vectors over the product of the square roots of the sum of all the elements squared. – Stef Nov 29 '21 at 11:02
  • This whole "1 -" confusion specifically stems from a confusion between distance and similarity. – Stef Nov 29 '21 at 11:03
  • But even if i add the "1-": "1-1.44e-05" will be 0.99-something and NOT in the range of 0.3-0.7 -> which scipy results in depending on adding 1- or not. – Typo Nov 29 '21 at 15:44