1

I've been reading nltk document recently.And I don't understand the following code.

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains(%s)' % word.lower()] = True
    return features

This is a feature extractor for NaiveBayesClassifier,but what does

features['contains(%s)' % word.lower()] = True

mean?

I think this line of code is a way to generate a dict,but I have no idea how it works.

Thanks

alvas
  • 115,346
  • 109
  • 446
  • 738
Wang Paul
  • 45
  • 8

2 Answers2

3

In this code:

>>> import nltk
>>> def word_features(sentence):
...     features = {}
...     for word in nltk.word_tokenize(sentence):
...         features['contains(%s)' % word.lower()] = True
...     return features
...     
...    
... 
>>> sent = 'This a foobar word extractor function'
>>> word_features(sent)
{'contains(a)': True, 'contains(word)': True, 'contains(this)': True, 'contains(function)': True, 'contains(extractor)': True, 'contains(foobar)': True}
>>> 

This line is trying to populate/fill up a features dictionary.:

features['contains(%s)' % word.lower()] = True

Here's a simple example of dictionary in python (see https://docs.python.org/2/tutorial/datastructures.html#dictionaries for details):

>>> adict = {}
>>> adict['key'] = 'value'
>>> adict['key']
'value'
>>> adict['apple'] = 'red'
>>> adict['apple']
'red'
>>> adict
{'apple': 'red', 'key': 'value'}

And word.lower() lowercase a string, e.g.

>>> str = 'Apple'
>>> str.lower()
'apple'
>>> str = 'APPLE'
>>> str.lower()
'apple'
>>> str = 'AppLe'
>>> str.lower()
'apple'

And when you do 'contains(%s)' % word it's trying to create string contain( and a sign operator and then a ). The sign operator will be assigned outside the string, e.g.

>>> a = 'apple'
>>> o = 'orange'
>>> '%s' % a
'apple'
>>> '%s and' % a
'apple and'
>>> '%s and %s' % (a,o)
'apple and orange'

The sign operator is similar to the str.format() function e.g.

>>> a = 'apple'
>>> o = 'orange'
>>> '%s and %s' % (a,o)
'apple and orange'
>>> '{} and {}'.format(a,o)
'apple and orange'

So when the code does 'contains(%s)' % word it's actually trying to produce a string like this:

>>> 'contains(%s)' % a
'contains(apple)'

And when you put that string into a dictionary as your key, your key will look as such:

>>> adict = {}
>>> key1 = 'contains(%s)' % a
>>> value1 = True
>>> adict[key1] = value1
>>> adict
{'contains(apple)': True}
>>> key2 = 'contains(%s)' % o
>>> value = 'orange'
>>> value2 = False
>>> adict[key2] = value2
>>> adict
{'contains(orange)': False, 'contains(apple)': True}

For more information, see

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
2

Say word='ABCxyz',


word.lower() ---> would convert it into lower case so it returns abcxyz'

'contains(%s)' % word.lower() ---> would format the string and replace %s with the value of word.lower() and returns 'contains(abcxyz)'

features['contains(%s)' % word.lower()] = True --->would create a key-value pair in features dictionary with key as 'contains(abcxyz)' and value as True

Thus,

features = {}
features['contains(%s)' % word.lower()] = True

would create

features = {'contains(abcxyz)':True}
Ashoka Lella
  • 6,631
  • 1
  • 30
  • 39
  • there's a minor mistake in your final feature dictionary. should be `features{contains('abcxyz'): True)`. – alvas Apr 11 '15 at 11:59