9

I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:

Here I have a matrix with shape (n, length), where each row represents a sentence composed by length words. So there are n sentences with same length in all. Then with a defined context size, e.g., window_size = 5, I want to calculate the co-occurrence matrix D, where the entry in the cth row and wth column is #(w,c), which means the number of times that a context word c appears in w's context.

An example can be referred here. How to calculate the co-occurrence between two words in a window of text?

I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix

So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.

Community
  • 1
  • 1
GEORGE GUO
  • 117
  • 1
  • 1
  • 5

2 Answers2

10

It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.

m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
        for j in range(max(i-window,0),min(i+window,length)):
             m[word,sentence[j]]+=1
for sentence in X:
    cal_occ(sentence, m)
Ruzihm
  • 19,749
  • 5
  • 36
  • 48
Zealseeker
  • 823
  • 1
  • 7
  • 23
  • Thanks a lot. I have saw the `CountVectorizer` function but the occurrence it calculate is not what I want here. In the occurrence matrix I mean here, each item represents the occurrence of a **central** word `w` and it's context word `c`. The function in `CountVectorizer` just calculate every pair of word's occurrence in a sentence. I can only accomplish it by using stack of loop, and I'm wondering whether simple API exits. Thanks for ur reply :) – GEORGE GUO Jan 16 '17 at 04:35
  • @GEORGEGUO I'm sorry for the mistaken use of CountVectorizer. But you can directly use the transcripted sentances with the code as I gave. In short, it's not complicated and you can make such "API" yourself. – Zealseeker Jan 16 '17 at 05:32
  • ya I know. Thanks – GEORGE GUO Jan 16 '17 at 06:11
  • @Holdwin Thanks for your editing. It's really helpful. I'm embarrassed that there were so many mistakes in my answer. – Zealseeker Nov 01 '18 at 01:28
  • @Ruzihm what is m[word,sentence[j]]+=1? i am getting slicing index error – Akshay Indalkar Jul 19 '19 at 17:25
  • @AkshayIndalkar It increments the co-occurrence of the word with index `word` and the word with the index at `sentence[j]` by one. Make sure all of the values in `sentence` are between 0 and `length-1`. If you still need help, [ask a question](https://stackoverflow.com/questions/ask) including your code and [a link back to this answer](https://stackoverflow.com/a/41663359/1092820). – Ruzihm Jul 19 '19 at 17:44
  • `length = len(sentence) for j in range(max(i - window_size, 0), min(i + window_size+1, length)): if i == j: continue m[word, sentence[j]] += 1` – Andrew Matiuk Oct 04 '20 at 17:20
  • It is a partial code where, to each sentence, *m* is a matrix of sentence length x length, *word* is in the sentence, *sentence* is a list of tokenized sentence, and *sentence[j]* another word in the sentence. – Mello May 03 '22 at 21:41
0

I have calcuated the Cooccurence matrix with window size =2

  1. first write a function which gives correct neighbourhood words (here i have used get context)

  2. Create matrix and just add 1 if the particuar value present in the neighbour hood.

Here is the python code:

import numpy as np
CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"]


top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' ')))
a = np.zeros((3,3), np.int32)
for  sentence in CORPUS:
    for index,word in enumerate(sentence.split(' ')):
       if word in top2000 : 
           print(word)
           context=GetContext(sentence,index)
           print(context)
           for word2 in context:
             if word2 in top2000:
                 a[top2000.index(word)][top2000.index(word2)]+=1
print(a)

get context function

def GetContext(sentence, index):
words = sentence.split(' ')
ret=[]
for word in words:

        if index==0:
            ret.append(words[index+1])
            ret.append(words[index+2])
        elif index==1:
            ret.append(words[index-1])
            ret.append(words[index+1])
        if len(words)>3:
                ret.append(words[index+2])
        elif index==(len(words)-1):
            ret.append(words[index-2])
            ret.append(words[index-1])
        elif index==(len(words)-2):
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
        else:
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
            ret.append(words[index+2])
        return ret     

here is result:

array([[0, 3, 3],
   [3, 0, 2],
   [3, 2, 0]])