How to replace bigrams in place using NLTK?

Question

Say I have a list of tuples, top_n, of the top n most common bigrams found in a corpus of text:

import nltk
from nltk import bigrams
from nltk import FreqDist

bi_grams = bigrams(text) # text is a list of strings (tokens)
fdistBigram = FreqDist(bi_grams)

n = 300
top_n= [list(t) for t in zip(*fdistBigram.most_common(n))][0]; top_n
>>> [('let', 'us'),
    ('us', 'know'),
    ('as', 'possible')
    ....

Now I want to replace instances of sets of words that are bigrams in top_n with their concatenation in place. For example, say we have a new variable query which is a list of strings:

query = ['please','let','us','know','as','soon','as','possible']

would become

['please','letus', 'usknow', 'as', 'soon', 'aspossible']

after the desired operation. More explicitly, I want to search every element of query and check if the ith and (i+1)th element are in top_n; if they are, then replace query[i] and query[i+1] with a single concatenated bigram i.e (query[i], query[i+1]) -> query[i] + query[i+1].

Is there some way to do this using NLTK, or what would be the best way to do this if looping over each word in query is necessary?

I have the feeling your input is wrong, the query is not a list of bigrams. — Arne, Dec 20 '17 at 13:20
If not, `[token_1+token_2 for token_1, token_2 in zip(query[:-1], [""]+query[2:])]` — Arne, Dec 20 '17 at 13:23
No wait, your output includes other single words except the first one. What exactly do you want, that is not a list of concatenated bigrams : — Arne, Dec 20 '17 at 13:28
Why a query of 8 words returns a list of only 6...? And what do you mean by in place? Your title seems a bit misleading. — Tai, Dec 20 '17 at 14:22

Arne · Accepted Answer · 2017-12-22T11:36:57.083

2

Given your code and the query, where words will be greedily replaced with their bi-grams if they were in the top_n, this will do the trick:

lookup = set(top_n)  # {('let', 'us'), ('as', 'soon')}
query = ['please', 'let', 'us', 'know', 'as', 'soon', 'as', 'possible']
answer = []
q_iter = iter(range(len(query)))
for idx in q_iter:
    answer.append(query[idx])
    if idx < (len(query) - 1) and (query[idx], query[idx+1]) in lookup:
        answer[-1] += query[idx+1]
        next(q_iter)
        # if you don't want to skip over consumed 
        # second bi-gram elements and keep 
        # len(query) == len(answer), don't advance 
        # the iterator here, which also means you
        # don't have to create the iterator in outer scope

print(answer)

Results in (for example):

>> ['please', 'letus', 'know', 'assoon', 'as', 'possible']

edited Dec 22 '17 at 11:36

answered Dec 20 '17 at 13:54

Arne

17,706
5
83
99

I appreciate the answer, but the problem with it is that you don't use the variable `top_n` when replacing words in `query`. I want to search every element of the variable `query` and check if the ith and (i+1)th element are in `top_n`; if they are, then replace `query[i]` and `query[i+1]` with a single concatenated bigram i.e `(query[i], query[i+1]) -> query[i] + query[i+1]`. Let me edit my question to make it more explicit. – PyRsquared Dec 20 '17 at 14:59
1

Allright, I think I understood your question now and will rephrase my answer. But jeez, deciphering that was hard work. – Arne Dec 20 '17 at 15:12
Ok i just noticed i am still wrong. What happens if three cosecutive words are in the top_n list? – Arne Dec 20 '17 at 15:24
If 3 or more consecutive words show up in that are in `top_n`, I want this result: `query=['as','possible','as','possible'] -> ['aspossible','aspossible']`, if that makes sense. – PyRsquared Dec 20 '17 at 15:29
so close, but I get the error `TypeError: 'range' object is not an iterator` when using the `next` keyword. How would this work for python 3.6? I tried to use `next(iter(q_iter))` but then I get the error `IndexError: list index out of range` – PyRsquared Dec 22 '17 at 10:40
1

Debugging that one phone was hard, but it should work now. `q_iter` was indeed not a proper iterator, and it needs to be transformed into one on instantiation. The `IndexError` needs to be avoided by skipping the last entry for the bi-gram check. – Arne Dec 22 '17 at 11:38

score 0 · Answer 2 · answered Dec 22 '17 at 12:00

Alternative answer:

from gensim.models.phrases import Phraser
from gensim.models import Phrases
phrases = Phrases(text, min_count=1500, threshold=0.01)
bigram = Phraser(phrases)
bigram[query]
>>> ['please', 'let_us', 'know', 'as', 'soon', 'as', 'possible']

Not exactly the desired output desired in the question, but it works as an alternative. The inputs min_count and threshold will strongly influence the output. Thanks to this question here.

How to replace bigrams in place using NLTK?

2 Answers2