I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence).
I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes.
df = pd.DataFrame.from_records(data=data, columns=names)
train, test = train_test_split(
df,
train_size=train_ratio,
random_state=1337
)
train_df = pd.DataFrame(train, columns=names)
test_df = pd.DataFrame(test, columns=names)
vectorizer = CountVectorizer()
train_matrix = vectorizer.fit_transform(train_df['text'])
test_matrix = vectorizer.transform(test_df['text'])
positive_cases_train = (train_df['label'] == 'decision')
positive_cases_test = (test_df['label'] == 'decision')
classifier = MultinomialNB()
classifier.fit(train_matrix, positive_cases_train)
The question is now. How can I additionally to the bag of words technique introduce the other features to my classifier?
Thanks in advance and if you need more information I am glad to provide those.
Edit: After adding the rows like suggested by @Guiem a new question regarding weight of the new feature. This Edit adds to that new question:
The shape of my train matrix is (2554, 5286)
. The weird thing though is that it is this shape with and without the sentiment column added (Maybe the row is not added properly?)
If I print the Matrix I get the following output:
(0, 322) 0.0917594575712
(0, 544) 0.196910480455
(0, 556) 0.235630958238
(0, 706) 0.137241420774
(0, 1080) 0.211125349374
(0, 1404) 0.216326271935
(0, 1412) 0.191757369869
(0, 2175) 0.128800602511
(0, 2176) 0.271268708356
(0, 2371) 0.123979845513
(0, 2523) 0.406583720526
(0, 3328) 0.278476810585
(0, 3752) 0.203741786877
(0, 3847) 0.301505063552
(0, 4098) 0.213653538407
(0, 4664) 0.0753937554096
(0, 4676) 0.164498844366
(0, 4738) 0.0844966331512
(0, 4814) 0.251572721805
(0, 5013) 0.201686066537
(0, 5128) 0.21174469759
(0, 5135) 0.187485844479
(1, 291) 0.227264696182
(1, 322) 0.0718526940442
(1, 398) 0.118905396285
: :
(2553, 3165) 0.0985290985889
(2553, 3172) 0.134514497354
(2553, 3217) 0.0716087169489
(2553, 3241) 0.172404983302
(2553, 3342) 0.145912701013
(2553, 3498) 0.149172538211
(2553, 3772) 0.140598133976
(2553, 4308) 0.0704700896603
(2553, 4323) 0.0800039075449
(2553, 4505) 0.163830579067
(2553, 4663) 0.0513678549359
(2553, 4664) 0.0681930862174
(2553, 4738) 0.114639856277
(2553, 4855) 0.140598133976
(2553, 4942) 0.138370066422
(2553, 4967) 0.143088901589
(2553, 5001) 0.185244190321
(2553, 5008) 0.0876615764151
(2553, 5010) 0.108531807984
(2553, 5053) 0.136354534152
(2553, 5104) 0.0928665728295
(2553, 5148) 0.171292088292
(2553, 5152) 0.172404983302
(2553, 5191) 0.104762377866
(2553, 5265) 0.123712025565
I hope that helps a little or did you want some other information?