4

I have the following sample dataframe:

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

The 'problem_definition_stopwords' field has already been tokenized with stop gap words removed.

I want to create n-grams from the 'problem_definition_stopwords' field. Specifically, I want to extract n-grams from my data and find the ones that have the highest point wise mutual information (PMI).

Essentially I want to find the words that co-occur together much more than I would expect them to by chance.

I tried the following code:

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# errored out here 
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(df['problem_definition_stopwords']))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10) 

The error I received was on the third chunk of code ... TypeError: join() argument must be str or bytes, not 'list'

Edit: a more portable format for the DataFrame:

>>> df.columns
Index(['No', 'category', 'problem_definition_stopwords'], dtype='object')
>>> df.to_dict()
{'No': {0: 175, 1: 211, 2: 912, 3: 572}, 'category': {0: 2521, 1: 1438, 2: 2698, 3: 2521}, 'problem_definition_stopwords': {0: ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'], 1: ['galley', 'work', 'table', 'stuck'], 2: ['cloth', 'stuck'], 3: ['stuck', 'coffee']}}
blacksite
  • 12,086
  • 10
  • 64
  • 109
PineNuts0
  • 4,740
  • 21
  • 67
  • 112
  • 1
    It doesn't look like you're using `nltk.corpus.genesis.words` in the correct way. Look at `help(nltk.corpus.genesis.words)`... That call looks for filenames, not an iterable (Series, in this case) of lists of strings. Would something like this work: `finder = BigramCollocationFinder.from_words(df['problem_definition_stopwords'].apply(lambda x: ' '.join(x)).values)`.. It "runs" without error on my machine, but I'm not sure if that's the output you're looking for. – blacksite Nov 30 '18 at 15:40

1 Answers1

2

It doesn't look like you're using the from_words call in the right way, looking at help(nltk.corpus.genesis.words)

Help on method words in module nltk.corpus.reader.plaintext:

words(fileids=None) method of nltk.corpus.reader.plaintext.PlaintextCorpusReader instance
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
(END)

Is this what you're looking for? Since you've already represented your documents as lists of strings, which plays nicely with NLTK in my experience, I think you can use the from_documents method:

finder = BigramCollocationFinder.from_documents(
    df['problem_definition_stopwords']
)

# only bigrams that appear 3+ times
# Note, I limited this to 1 since the corpus you provided
# is very small and it'll be tough to find repeat ngrams
finder.apply_freq_filter(1) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10) 

[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • ah that worked! thank you ; is there a way to adjust the code so that I have trigrams or 4-grams?; for example if I wanted [('word', 'word', 'word')]; I tried changing bigram_measures to trigram_measures but that still gave back two words – PineNuts0 Nov 30 '18 at 20:31
  • Also, I would like to get the frequency out as well. So I want to know how many times ('brewing', 'properly') occurred in that column – PineNuts0 Nov 30 '18 at 20:39