Support Vector Machines - building features out of word count and context

Question

This python code creates features which reflect whether a given keyword is present or not in the given tweet.

#get feature list stored in a file (for reuse)
featureList = getFeatureList('data/sampleTweetFeatureList.txt')

#start extract_features
def extract_features(tweet):
    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)
    return features
#end

And the output would look like this :

{
    'contains(arm)': True,             #notice this
    'contains(articles)': False,
    'contains(attended)': False,
    'contains(australian)': False,
    'contains(awfully)': False,
    'contains(bloodwork)': True,       #notice this
    'contains(bombs)': False,
    'contains(cici)': False,
    .....
    'contains(head)': False,
    'contains(heard)': False,
    'contains(hey)': False,
    'contains(hurts)': True,           #notice this
    .....
    'contains(irish)': False,
    'contains(jokes)': False,
    .....
    'contains(women)': False
}

Now, how do I go about building the feature vector if the feature set also includes (apart from presence of keywords as shown above) :

Word count in the given tweet
Context of a special keyword like 'earthquake'. For e.g. the left and right word surrounding 'earthquake' in the example 'japan earthquake now' are 'japan' and 'now'.

Edit : What I want to figure out is, how to capture this information (word count and context) in a way that I get vectors required for SVM algorithm to work? Until now what I have is a vector in |featureList| dimension space. How do I extend it to include word count and context as well?

score 0 · Answer 1 · edited May 23 '17 at 11:57

Use split() to get a list of words and len() to find the amount of items in a list to find the number of words in a sentence:

word_count = len(tweet.split())

When you need to store multiple values, such as your context, you can use tuples, a bit like this:

features['contains(%s)' % word] = (word in tweet_words, previous_word, next_word)

So that the map looks like this:

{
    'contains(arm)': (True, 'broken', 'was'),
    'contains(articles)': (False, '', ''),
    ...
}

And can enumerate like this:

for feature in features:
    for word, previous, next in features[feature]:
        if word:
            print previous
            print next

There's a catch in your original solution: you did not take into account duplicate words. The use of a set() implies unique elements. Use a list [] if you want to keep duplicates or a map {} if you want to order things up a bit for faster lookups.

Using a class will let you manipulate the data more easily. Instead of a list of words, we can get extra flexibility by using a map of words to a list of their contexts. And why not throw in the location of the word within the tweet?

class Tweet(object):
    def __init__(self, tweet):
        self.text= tweet
        self.words = tweet.split()
        self.word_count = len(words)

        # dictionary comprehension - preliminary map of 'word' to empty list
        self.contexts = {word: [] for word in words}

        for idx, word in enumerate(self.words):
            self.contexts[word].append(
                idx,                                                 # idx
                self.words[idx] if idx > 0 else '',                  # previous word
                self.words[idx+1] if idx < self.word_count else '')  # next word

You could then rewrite your function this way, although duplicates still aren't handled:

def extract_features(tweet_str):
    tweet = Tweet(tweet_str)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet.words)
    return features

There are so many more things you can do with it now:

# enumerate each word, their location and context:
for word in tweet.words:
    location, previous, next = tweet.contexts[word]

# get the word count:
print tweet.word_count

# how many apples?
print len(word.lower() for word in tweet.words if word.startswith('apple'))

# how many apples, again?
print len(tweet.contexts['apples'])  # of course you will miss 'apple' and 'Apples', etc

# did he used the word 'where'?
print 'where' in tweet.words  # note: 'Where' will not match because of capital W

What I want to figure out is, how to capture this information (word count and context) in a way that I get vectors required for SVM algorithm to work? Until now what I have is a vector in |featureList| dimension space. How do I extend it to include word count and context as well? — simplfuzz, Apr 13 '15 at 04:07

Support Vector Machines - building features out of word count and context

1 Answers1