How to find bi-grams which include pre-defined words?

Question

I know it is possible to find bigrams which have a particular word from the example in the link below:

finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)

bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
    >>>

nltk: how to get bigrams containing a specific word

But I am not sure how this can be applied if I need bigrams containing both words pre-defined.

Example:

My Sentence: "hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"

Given a list:["yesterday", "other", "I", "side"] How can I get a list of bi-grams with the given words. i.e: [("yesterday", "I"), ("other", "side")]?

Mortz · Accepted Answer · 2018-12-19T08:52:26.730

1

What you want is probably a word_filter function that returns False only if all the words in a particular bigram are part of the list

def word_filter(x, y):
    if x in lst and y in lst:
        return False
    return True

where lst = ["yesterday", "I", "other", "side"]

Note that this function is accessing the lst from the outer scope - which is a dangerous thing, so make sure you don't make any changes to lst within the word_filter function

edited Dec 19 '18 at 08:52

answered Dec 18 '18 at 19:50

Mortz

4,654
1
19
35

Thank you for your answer @Mortz. I do not want to find all combination of bigrams from the list. To be more precise, I was looking for a way to find all bigrams in a text that contains both words in the given list. – Steve Dec 18 '18 at 20:51
When you say "both words", do you mean you also want to consider, say, `(" yesterday ", " side")` as a valid bigram to be found? – Mortz Dec 19 '18 at 05:36
Yes, exactly. That is what I meant. – Steve Dec 19 '18 at 07:10
yes, instead of using a function though I created a list of tuples with all bigrams and loop through them, checking each time if both words of the bigram are in the list of words and removing the invalid ones. – Steve Dec 20 '18 at 10:11
The only problem with creating a list of bigrams from your search list is that once your search list starts getting larger, it starts getting computationally expensive to create a bigrams list. Always best to use in built functions – Mortz Dec 20 '18 at 10:35

Venkatachalam · Answer 2 · 2018-12-19T17:04:50.343

First you can create all possible bigrams for your vocabulary and feed that as the input for a countVectorizer, which can transform your given text into bigram counts.

Then, you filter the generated bigrams based on the counts given by countVectorizer.

Note: I have changed the token pattern to account for even single character. By default, it skips the single characters.

from sklearn.feature_extraction.text import CountVectorizer
import itertools

corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count])

output:

['yesterday i', 'other side']

This approach would be a better approach when you have more number of documents and less number of words in the vocabulary. If its other way around, you can find all the bigrams in the document first and then filter it using your vocabulary.

How to find bi-grams which include pre-defined words?

2 Answers2