3

I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence).

I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes.

df = pd.DataFrame.from_records(data=data, columns=names)
train, test = train_test_split(
    df,
    train_size=train_ratio,
    random_state=1337
)
train_df = pd.DataFrame(train, columns=names)
test_df = pd.DataFrame(test, columns=names)
vectorizer = CountVectorizer()
train_matrix = vectorizer.fit_transform(train_df['text'])
test_matrix = vectorizer.transform(test_df['text'])
positive_cases_train = (train_df['label'] == 'decision')
positive_cases_test = (test_df['label'] == 'decision')
classifier = MultinomialNB()
classifier.fit(train_matrix, positive_cases_train)

The question is now. How can I additionally to the bag of words technique introduce the other features to my classifier?

Thanks in advance and if you need more information I am glad to provide those.

Edit: After adding the rows like suggested by @Guiem a new question regarding weight of the new feature. This Edit adds to that new question:

The shape of my train matrix is (2554, 5286). The weird thing though is that it is this shape with and without the sentiment column added (Maybe the row is not added properly?)

If I print the Matrix I get the following output:

  (0, 322)  0.0917594575712
  (0, 544)  0.196910480455
  (0, 556)  0.235630958238
  (0, 706)  0.137241420774
  (0, 1080) 0.211125349374
  (0, 1404) 0.216326271935
  (0, 1412) 0.191757369869
  (0, 2175) 0.128800602511
  (0, 2176) 0.271268708356
  (0, 2371) 0.123979845513
  (0, 2523) 0.406583720526
  (0, 3328) 0.278476810585
  (0, 3752) 0.203741786877
  (0, 3847) 0.301505063552
  (0, 4098) 0.213653538407
  (0, 4664) 0.0753937554096
  (0, 4676) 0.164498844366
  (0, 4738) 0.0844966331512
  (0, 4814) 0.251572721805
  (0, 5013) 0.201686066537
  (0, 5128) 0.21174469759
  (0, 5135) 0.187485844479
  (1, 291)  0.227264696182
  (1, 322)  0.0718526940442
  (1, 398)  0.118905396285
  : :
  (2553, 3165)  0.0985290985889
  (2553, 3172)  0.134514497354
  (2553, 3217)  0.0716087169489
  (2553, 3241)  0.172404983302
  (2553, 3342)  0.145912701013
  (2553, 3498)  0.149172538211
  (2553, 3772)  0.140598133976
  (2553, 4308)  0.0704700896603
  (2553, 4323)  0.0800039075449
  (2553, 4505)  0.163830579067
  (2553, 4663)  0.0513678549359
  (2553, 4664)  0.0681930862174
  (2553, 4738)  0.114639856277
  (2553, 4855)  0.140598133976
  (2553, 4942)  0.138370066422
  (2553, 4967)  0.143088901589
  (2553, 5001)  0.185244190321
  (2553, 5008)  0.0876615764151
  (2553, 5010)  0.108531807984
  (2553, 5053)  0.136354534152
  (2553, 5104)  0.0928665728295
  (2553, 5148)  0.171292088292
  (2553, 5152)  0.172404983302
  (2553, 5191)  0.104762377866
  (2553, 5265)  0.123712025565

I hope that helps a little or did you want some other information?

d.a.d.a
  • 1,296
  • 1
  • 12
  • 28
  • The fact you say your matrix size is the same indicates something is wrong with the adding of a feature. Are your sure your are doing the insert to the dense matrix and that you are printing the new matrix size as well? Otherwise you are right and it's really weird the size is the same. – Guiem Bosch Feb 10 '16 at 16:10
  • Apart from that, I've been thinking of your problem lately (yes, you got me involved in it!) and I still have a "conceptual" doubt. I mean, you asked how to add new features and I came with a possible solution. But if you are telling me that this new feature is, for example, the sentiment of the text sample, conceptually I'd tend to say this is implicit in the sample itself. So it's kind of redundant. – Guiem Bosch Feb 10 '16 at 16:18
  • Unless you perform sentiment analysis in a more semantic way so you really add new info. But if sentiment is based on word polarity (pos, neg) your BOW should collect that info in your tf-idf representation. Don't know if that makes sense to you, cheers! – Guiem Bosch Feb 10 '16 at 16:19
  • yes I add the new features into the dense matrix but afterwards convert it to a sparse matrix again: `dense_matrix = train_matrix.todense() np.insert(dense_matrix, dense_matrix.shape[1], train_df['sentiment'], axis=1 train_matrix = csr_matrix(dense_matrix)` – d.a.d.a Feb 10 '16 at 17:13
  • Maybe you are right with the sentiment and it being in the BOW implicitly but it is for my thesis and I need to do those experiments as my supervisor wants them :-) – d.a.d.a Feb 10 '16 at 17:18
  • @Guiem I am sorry looks like I had a major fail in my program. `np.insert` returns a new matrix but the dense matrix remains the same. after using the return value it adds the features and my accuracy (unfortunately) goeas down by .2 percent. Thats basically all I wanted for now. Thanks a lot for helping me that much. – d.a.d.a Feb 11 '16 at 14:54
  • No problem, I'm glad it made sense in the end (even if performance was not increased). Good luck with your thesis – Guiem Bosch Feb 11 '16 at 15:30
  • @Guiem one more question if I may ask. You said in previous comments that maybe the weight of the newly added features have to be adjusted. Is that only needed if I dont use tf-idf or should I always do that if so how is the best approach? – d.a.d.a Feb 12 '16 at 11:46
  • Sorry, when I said weight I strictly meant **feature scaling**, which is considered a good practice with most of the classifiers. Most of them use distance measures, so if a feature is not normalized it could 'weight' more than others. So, imagine you use tf-idf and you have values between 0 and 1 in all feature columns (`col1: 0.7, col2: 0.45, ...`) and then you add sentiment from 0 to 10. As it's range is broader this feature will govern the others. Check this out, hopefully it will clarify my words: https://en.wikipedia.org/wiki/Feature_scaling – Guiem Bosch Feb 12 '16 at 16:10

2 Answers2

3

One option would be to just add these two new features to your CountVectorizer matrix as columns.

As you are not performing any tf-idf, your count matrix is going to be filled with integers so you could encode your new columns as int values.

You might have to try several encodings but you can start with something like:

  • sentiment [-5,...,5] transformed to [0,...,10]
  • string with topic of sentence. Just assign integers to different topics ({'unicorns':0, 'batman':1, ...}), you can keep a dictionary structure to assign integers and avoid repeating topics.

And just in case you don't know how to add columns to your train_matrix:

dense_matrix = train_matrix.todense() # countvectorizer returns a sparse matrix
np.insert(dense_matrix,dense_matrix.shape[1],[val1,...,valN],axis=1)

note that the column [val1,...,valN] needs to have the same lenght as num. samples you are using

Even though it won't be strictly a Bag of Words anymore (because not all columns represent word frequency), just adding this two columns will add up the extra information you want to include. And naive Bayes classifier considers each of the features to contribute independently to the probability, so we are okay here.

Update: better use a 'one hot' encoder to encode categorical features (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). This way you prevent weird behavior by assigning integer values to your new features (maybe you can still do that with sentiment, because in a scale of sentiment from 0 to 10 you assume that a 9 sentiment is closer to a sample with sentiment 10 rather than another with sentiment 0). But with categorical features you better do the one-hot encoding. So let's say you have 3 topics, then you can use same technique of adding columns only now you have to add 3 instead of one [topic1,topic2,topic3]. This way if you have a sample that belongs to topic1, you'll encode this as [1 , 0 , 0], if that's topic3, your representation is [0, 0, 1] (you mark with 1 the column that corresponds to the topic)

Guiem Bosch
  • 2,728
  • 1
  • 21
  • 37
  • 1
    Thanks a lot that works. One question still: If I use tf-idf and have floating point sentiment between -1 and 1 is that still the correct way to just append a collumn to my train and test matrix? – d.a.d.a Feb 09 '16 at 11:00
  • 1
    And also would I be able to use the same approach for SVC classifiers or is that wrong (are features also independent in SVC?)? – d.a.d.a Feb 09 '16 at 12:14
  • Hey, if you use tf-idf you can normalize your sentiment in the range [0,1]. And yeah, I encourage you to compare results with SVC, no need to think of independent or not, just think of it as extra features. Don't forget to use hot encoding for the topic, though. – Guiem Bosch Feb 09 '16 at 13:40
  • One question when I add these I get the exact same precision recall and fscore as if I don't use the sentiment feature. Is there any chance this is not as intended because it seems more than odd that all floating point numbers are exactly the same – d.a.d.a Feb 09 '16 at 16:12
  • Could you print for example two training samples after adding the sentiment and its values for every column (feature)? How many features do you have? What I'm trying to see here is if your new feature represents almost nothing in the general sample and thus does not vary the result. If so we could think of a way to weight more thatspecific feature, right? Cheers, and very interesting questions are arising from your case – Guiem Bosch Feb 09 '16 at 18:31
  • could you help me one more time? when I use SVC(kernel='linear') and add the additional features I get the following error `ValueError: X.shape[1] = 6568 should be equal to 8650, the number of features at training time` when predicting. Do you know how I can fix that? – d.a.d.a Feb 18 '16 at 16:34
  • Ignore me pleas I inserted wrong again -.- I am sorry – d.a.d.a Feb 18 '16 at 17:31
0

A less hacky way to do that is to use scikit-learn's FeatureUnion and to basically concatenare the text embeddings to the tabular data embeddings.

Check out the answer to these 2 other SO questions:

You would then pass the output of the FeatureUnion into a classifier as part of a Pipeline.

louis_guitton
  • 5,105
  • 1
  • 31
  • 33