I am trying to implement a machine learning algorithm that will help me with two goals:
1) Classify a given string in a set into a predetermined category based on their content.
2) Estimate the confidence that a given string belongs in the category
An example set of strings and their categories is below:
"Damage to right rear fender" -- Problem
"Scratch. Side view mirror" -- Problem
"Next scheduled maintenance on 12/23/2016" -- Appointment
"Customer should return on 1/1/2017" -- Appointment
"Red car, Volkswagon" -- Description
"Car is dark gray with large scratch on the side" -- Description
" Do not fill the car with premium fuel" -- Instruction
"Engine should cool to <100 celcius before driving" -- Instruction
I am brand new to machine learning and so am trying to figure out the best approach to accomplish my goal in python. I have a training set of approximately 1000 strings and a test set of 5000 strings.
My first approach was to try a One vs. Rest classifier using Scikit (Credit to @Cerin and @JMaurer), but on implementation the results were not great (only 55% of my results were categorized correctly on manual review). I suspect because these strings contain symbols and numbers that contribute to their overall categorization.
Can anybody else with a bit more experience comment on if this is the right approach for the task or if there is a better method that I could utilize? I am a bit in the dark and am really looking for some breadcrumbs to point me in the right direction.
Thanks.
Paul