6

Is it possible to get concordance for a phrase in NLTK?

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus  = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)

corpus.concordance("claim")

for example the above returns

on okay okay okay i can give you the claim number and my information and
 decide on the shop okay okay so the claim number is xxxx - xx - xxxx got

and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.

b3000
  • 1,547
  • 1
  • 15
  • 27
Naresh MG
  • 633
  • 2
  • 11
  • 19
  • 1
    NLTK.text.concordance seems to only take a single word. However, an option would be to replace 'claim number' by 'claim_number' in both texts and get a concordance for 'claim_number'. – Justin D. Nov 20 '15 at 00:31

3 Answers3

8

According to this issue it is not (yet) possible to search for multiple words with the concordance() function.

b3000
  • 1,547
  • 1
  • 15
  • 27
6

If you read the discussion under the very issue that @b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:

>>> from nltk.app import concordance
>>> concordance()
alexis
  • 48,685
  • 16
  • 101
  • 161
4

I munged together this solution...

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/

    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                    for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    return

n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
    u'; for Whales of a monstrous size are oftentimes cast up dead ']
psychemedia
  • 5,690
  • 7
  • 52
  • 84