Scaled Co-occurrence matrix with window size calculation in python

Question

Say, I've a dataset in CSV format, which contains sentences/paragraphs in rows. Suppose, it looks like this:

df = ['A B X B', 'X B B']

Now, I can generate co-occurrence matrix that looks like this

Here, (A,B,X) are words. It says B appeared where X is present = 4 times Code that I used for it

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

The beauty of this code segment is that it allows me to choose windows size. That means if a particular word doesn't appear with a fixed amount of range from total sentence size then it gets ignored. But I would like to scale it.

So this means if a word is far from the target word "to" then it will be assigned lesser values. Unfortunately, I couldn't find a suitable solution for it. Is it possible with a package such as scikit-learn? Or is there any other way to do it except raw coding?

You're ever so close to asking for a tool, library, etc. ;-) — thebjorn, Jul 17 '20 at 08:29
Does the co-occurence matrix come from the df declared above it? — thebjorn, Jul 17 '20 at 08:30
Did you already see: https://stackoverflow.com/a/49667439/75103 ? — thebjorn, Jul 17 '20 at 10:12
@thebjorn it produces the same co-occurence matrix that I've already managed to create. I'm trying to create a scaled one. — AtanuCSE, Jul 17 '20 at 11:24

Zachary Yocum · Accepted Answer · 2020-07-17T16:26:14.357

Here’s an implementation that can optionally scale the accumulated co-occurence values based on the distance between word tokens in the input sentences:

In [11]: sentences = ['from swerve of shore to bend of bay , brings'.split()]                                    

In [12]: index, matrix = co_occurence_matrix(sentences, window=3, scale=True)                                    

In [13]: cell = index['bend'], index['of']                                                                       

In [14]: matrix[cell]                                                                                            
Out[14]: 1.3333333333333333

In [15]: index, matrix = co_occurence_matrix(sentences, window=3, scale=False)                                   

In [16]: matrix[cell]                                                                                            
Out[16]: 2.0

In [17]: {w: matrix[index['to']][i] for w, i in index.items()}                                                   
Out[17]: 
{',': 0.0,
 'bend': 1.0,
 'of': 1.0,
 'bay': 0.3333333333333333,
 'brings': 0.0,
 'to': 0.0,
 'from': 0.0,
 'shore': 1.0,
 'swerve': 0.3333333333333333}

@AtanuCSE, it’s possible the `distances` method could be optimized a bit more—there’s probably a cleaner iterative solution to check the tokens rather than going through all pairwise combinations—but glad this is what you were looking for. You could also play around with other distance metrics, like `math.sqrt(distance)`. — Zachary Yocum, Jul 17 '20 at 21:19

Scaled Co-occurrence matrix with window size calculation in python

1 Answers1