Say, I've a dataset in CSV format, which contains sentences/paragraphs in rows. Suppose, it looks like this:
df = ['A B X B', 'X B B']
Now, I can generate co-occurrence matrix that looks like this
A B X
A 0 2 1
B 2 0 4
X 1 4 0
Here, (A,B,X) are words. It says B appeared where X is present = 4 times Code that I used for it
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
The beauty of this code segment is that it allows me to choose windows size. That means if a particular word doesn't appear with a fixed amount of range from total sentence size then it gets ignored. But I would like to scale it.
So this means if a word is far from the target word "to" then it will be assigned lesser values. Unfortunately, I couldn't find a suitable solution for it. Is it possible with a package such as scikit-learn? Or is there any other way to do it except raw coding?