21

The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix."

The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick.

So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?

Evgenia Karunus
  • 10,715
  • 5
  • 56
  • 70
bernie2436
  • 22,841
  • 49
  • 151
  • 244

2 Answers2

28

What is a co-occurrence matrix ?

Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.

Consider the following text :

Roses are red. Sky is blue.

With the classic approach described before, we'll have the following matrix :

      |  Roses | are | red | Sky | is | blue
Roses |    1   |  1  |  1  |  0  |  0 |   0
are   |    1   |  1  |  1  |  0  |  0 |   0
red   |    1   |  1  |  1  |  0  |  0 |   0
Sky   |    0   |  0  |  0  |  1  |  1 |   1
is    |    0   |  0  |  0  |  1  |  1 |   1
Blue  |    0   |  0  |  0  |  1  |  1 |   1

Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.

What are they used for in NLP ?

The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :

ProductX is amazing.

I hate productY.

Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

Community
  • 1
  • 1
merours
  • 4,076
  • 7
  • 37
  • 69
  • Your example is a little odd. If the columns represent sentences, then you should have only 2 columns and the table will effectively be an index. What you have there is a term-term matrix, where the context of a word only spans the sentence it occurs in. – mbatchkarov Jun 06 '14 at 09:19
  • 1
    Sorry I wasn't clear enough. Columns are words, such as lines. Sentences don't appear in the matrix, they are just used implicitely. – merours Jun 06 '14 at 09:25
  • You might have more counts when you have more sentences. – Daniel Jun 06 '14 at 20:38
  • 3
    a better example might have been the good ol' "Roses are red, Violets are blue" to illustrate that token "are" appears twice, but exists just once in the column/row headings. – michael Sep 26 '17 at 01:34
  • 1
    I've been very confused on how co-occurrence relates to covariance and correlation. Can anyone help clarify? – Jay Shin Mar 28 '18 at 09:09
  • @JayShin this is an interesting question on its own, you should ask it it its thread :) – merours Apr 04 '18 at 19:28
  • @JayShin co-occurrence is mainly an NLP concept (there are generalizations, but that is the main usage) whereas covariance is a very general stats concept involving extent of variability of two random variables with respect to one another. Correlation coefficient can be defined in terms of the covariance operator, so i'd suggest reading up on covariance, and then you will be more equipped to understand correlation coefficiencts, and various other forms of correlation (such as pearson or spearman.) I would not conflate correlation with co-occurrence. – Blake Jun 06 '18 at 19:31
14

The co-occurrence matrix indicates how many times the row word (e.g. 'digital') is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie').

The entry '5' in the following table, for example, means that we had 5 sentences in our text where 'digital' was surrounded by 'pie'.

enter image description here

These sentences could have been:

  • I love a digital pie.
  • What's digital is often a pie.
  • May I have some digital pie?
  • Digital world necessitates pie-eating.
  • There's something digital about this pie.

Note that the co-occurrence matrix is always symmetric - the entry with the row word 'pie' and the column word 'digital' will be 5 as well (as these words co-occur in the very same sentences!).

Evgenia Karunus
  • 10,715
  • 5
  • 56
  • 70