Basically, I want to reimplement this video.
Given a corpus of documents, I want to find the terms that are most similar to each other.
I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix.
Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M
has n
rows and 4
columns.
HAVE:
M = [[18,34,54,65], # Term IDs similar to Term t_0
[18,12,54,65], # Term IDs similar to Term t_1
...
[21,43,55,78]] # Term IDs similar to Term t_n.
So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M
above, it seems that term t_0
and term t_1
are quite similar, because three out of four terms match, where as terms t_0
and t_n
are not similar, because no terms match. Let's write M
as a series of lists.
M = [list_0, # Term IDs similar to Term t_0
list_1, # Term IDs similar to Term t_1
...
list_n] # Term IDs similar to Term t_n.
WANT:
C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
[f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
...
[f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]
I'd like to find the matrix C
, that has as its elements, a function f
applied to the lists of M
. f(a,b)
measures the degree of similarity between two lists a
and b
. Going, with the example above, the degree of similarity between t_0
and t_1
should be high, whereas the degree of similarity of t_0
and t_n
should be low.
My questions:
- What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function
f
? - Is there a transformation already available that takes as an input a matrix like
M
and produces a matrix likeC
? Preferably a python package?
Thank you, r0f1