concordance for a phrase using NLTK in Python

Question

Is it possible to get concordance for a phrase in NLTK?

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus  = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)

corpus.concordance("claim")

for example the above returns

on okay okay okay i can give you the claim number and my information and
 decide on the shop okay okay so the claim number is xxxx - xx - xxxx got

and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.

NLTK.text.concordance seems to only take a single word. However, an option would be to replace 'claim number' by 'claim_number' in both texts and get a concordance for 'claim_number'. — Justin D., Nov 20 '15 at 00:31

score 8 · Accepted Answer · answered Nov 23 '15 at 19:39

8

According to this issue it is not (yet) possible to search for multiple words with the concordance() function.

answered Nov 23 '15 at 19:39

b3000

1,547
1
15
27

score 6 · Answer 2 · answered Nov 23 '15 at 20:25

If you read the discussion under the very issue that @b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:

>>> from nltk.app import concordance
>>> concordance()

score 4 · Answer 3 · answered Dec 13 '15 at 14:19

I munged together this solution...

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/

    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)&gt;0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                    for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    return

n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
    u'; for Whales of a monstrous size are oftentimes cast up dead ']

concordance for a phrase using NLTK in Python

3 Answers3

Linked