1

Say, I've a dataset in CSV format, which contains sentences/paragraphs in rows. Suppose, it looks like this:

df = ['A B X B', 'X B B']

Now, I can generate co-occurrence matrix that looks like this

  A B X
A 0 2 1
B 2 0 4
X 1 4 0

Here, (A,B,X) are words. It says B appeared where X is present = 4 times Code that I used for it

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

The beauty of this code segment is that it allows me to choose windows size. That means if a particular word doesn't appear with a fixed amount of range from total sentence size then it gets ignored. But I would like to scale it.

Scaled co-occurence

So this means if a word is far from the target word "to" then it will be assigned lesser values. Unfortunately, I couldn't find a suitable solution for it. Is it possible with a package such as scikit-learn? Or is there any other way to do it except raw coding?

AtanuCSE
  • 8,832
  • 14
  • 74
  • 112

1 Answers1

1

Here’s an implementation that can optionally scale the accumulated co-occurence values based on the distance between word tokens in the input sentences:

In [11]: sentences = ['from swerve of shore to bend of bay , brings'.split()]                                    

In [12]: index, matrix = co_occurence_matrix(sentences, window=3, scale=True)                                    

In [13]: cell = index['bend'], index['of']                                                                       

In [14]: matrix[cell]                                                                                            
Out[14]: 1.3333333333333333

In [15]: index, matrix = co_occurence_matrix(sentences, window=3, scale=False)                                   

In [16]: matrix[cell]                                                                                            
Out[16]: 2.0

In [17]: {w: matrix[index['to']][i] for w, i in index.items()}                                                   
Out[17]: 
{',': 0.0,
 'bend': 1.0,
 'of': 1.0,
 'bay': 0.3333333333333333,
 'brings': 0.0,
 'to': 0.0,
 'from': 0.0,
 'shore': 1.0,
 'swerve': 0.3333333333333333}

  • Working like a charm. Thank you. – AtanuCSE Jul 17 '20 at 21:05
  • 1
    @AtanuCSE, it’s possible the `distances` method could be optimized a bit more—there’s probably a cleaner iterative solution to check the tokens rather than going through all pairwise combinations—but glad this is what you were looking for. You could also play around with other distance metrics, like `math.sqrt(distance)`. – Zachary Yocum Jul 17 '20 at 21:19