This python code creates features which reflect whether a given keyword is present or not in the given tweet.
#get feature list stored in a file (for reuse)
featureList = getFeatureList('data/sampleTweetFeatureList.txt')
#start extract_features
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
#end
And the output would look like this :
{
'contains(arm)': True, #notice this
'contains(articles)': False,
'contains(attended)': False,
'contains(australian)': False,
'contains(awfully)': False,
'contains(bloodwork)': True, #notice this
'contains(bombs)': False,
'contains(cici)': False,
.....
'contains(head)': False,
'contains(heard)': False,
'contains(hey)': False,
'contains(hurts)': True, #notice this
.....
'contains(irish)': False,
'contains(jokes)': False,
.....
'contains(women)': False
}
Now, how do I go about building the feature vector if the feature set also includes (apart from presence of keywords as shown above) :
- Word count in the given tweet
- Context of a special keyword like 'earthquake'. For e.g. the left and right word surrounding 'earthquake' in the example 'japan earthquake now' are 'japan' and 'now'.
Edit : What I want to figure out is, how to capture this information (word count and context) in a way that I get vectors required for SVM algorithm to work? Until now what I have is a vector in |featureList| dimension space. How do I extend it to include word count and context as well?