I have two sentences:
Skies are blue. Grass is green
I would like to compute simple matrix of word vector space embedding or matrix of co-occurrences, I am not sure what proper terminology is. But here is that I want. So I have 6 distinct words from two sentences above, so my matrix will be 6 by 6. Assume that my words have the following ordering corresponding to rows or column ordering: 0 - Skies, 1 - are, 2 - blue, 3 - Grass, 4 - is, 5 - green. Then I would like to count co-occurrence using size of window = 2 (meaning 2 words prior to current word and 2 words after current word).
- Element with index [0,0] will have value 0, since
Skies
do not co-occur withSkies
. - Element with index [0,1] will have value,
since
are
occurs next toSkies
only once - Element with index [0,2]
will have value, since
blue
occurs next toSkies
only once.
So on and so forth. Is there scikit module for it? I looked at the following question , but it does not seem to answer my question.
Update This matrix is key object of distributional hypothesis.