1

I need to calculate cosine similarity on a huge files that include rows of numbers, for example:

6 3 574

11 1 6 575 576 321

4 577 6 64

69 11 6 55

11 218 6 578 579 580 581 229 582 583 155 100 584 148 446 585

I already store it on a matrix of string, that make the split and each number is different cell.

string[] lines = FileBuff.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
            FileMatrix = new string[lines.Length][];
            for (int i = 0; i < lines.Length; i++)
            {
                FileMatrix[i] = lines[i].Split(new string[] { "\t", " " }, StringSplitOptions.None);
            }

My question is how to calculate cosine similarity of rows that is in different sizes?
for calc the numerator its must to be in the same size (A[i]*B[i]+A[i+1]*B[i+1]+.....)

i found this example, its the same problem like mine just with letters:

Document 1: The quick brown fox jumped over the lazy dog.

Global order:     The quick brown fox jumped over the lazy dog
Vector for Doc 1:  1    1     1    1     1     1    1   1   1

Document 2: The runner was quick.

Global order:     The quick brown fox jumped over the lazy dog runner was
Vector for Doc 1:  1    1     1    1     1     1    1   1   1
Vector for Doc 2:  1    1     0    0     0     0    0   0   0    1     1

In this case, in theory I need to pad the Document 1 vector with zeroes on the end. i need help for some code that makes it

Vertexwahn
  • 7,709
  • 6
  • 64
  • 90
  • So each row is a vector? e.g. the first row of your example is a 3D vector and the second row is a 5D vector? – Frank J Mar 23 '16 at 20:46
  • Then it's probably not right to call it a vector, so its can be regarded as rows of numbers that should calculate for each pair of rows the cosine similarity between them – Itzik BenZaquen Mar 23 '16 at 21:11
  • Well only because you call it something else that doesn't make the problem go away. AFAIK you need the same amount of dimension to calculate the cosine similarity. You can however find out how long the longest vector/row is and pad all the shorter ones with default values for the missing dimensions (e.g. zeros). If you don't know what the data represents you can't really determine similarity... – Frank J Mar 23 '16 at 21:31

2 Answers2

0

Vectors must be of the same length. If they are not, you have to pad the one that has smaller dimensionality with zeros. Basically the logic is as following:

Consider 2 vectors: (0,1) and (0,0,1).

The first one is 2D, the second one is 3D. You can consider 2D vector as a 3D vector, but located in (x,y) plane. So (0,1) is equivalent to (0,1,0).

Also see an answer to this question in the Python section.

Community
  • 1
  • 1
0

It depends.

If your data is supposed to be a continuous vector space, then vectors have to be the same length.

If your data is a sparse vector, then by definition missing values are 0 (usually).

Your data looks as if you only have the indexes of 1s.

Then cosine boils down to counting the intersection size (divided by the geometric mean length); I'd go with Jaccard on such data instead.

You need to know the input format - there are multiple answers, unless you give the essential information how the data is encoded, and what it means.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • In my project i get a big datasets of numbers (a lot of lines - the 5 lines i show is just an example) and i need to make comparison between Jaccard Distance and Cosine Similarity, to get distances, And then to use the algorithm of K-means. So my question is how to calc the numerator in the formula? – Itzik BenZaquen Mar 23 '16 at 21:28
  • What is 574? it's *you* who needs to answer this. If these are random numbers, stop using random numbers. (P.S. do *not* use k-means with other distances). – Has QUIT--Anony-Mousse Mar 23 '16 at 22:12