3

I have a machine learning classifier from create ML. The model is trained with 3400 samples and overall is impressive in how accurate it is. However The model ever so often makes a prediction that is wrong and I can't seem to figure out how I add this into the model so that it does not have such high confidence on some of these wrong prediction. How/can I re-train the model with this new piece of data that is wrong in its classification that it is not of this classification? Should I be making a new classification folder and add these wrong classification to it or is there a way in training the model to pass in as a not classification type so it can try to understand the difference between them?

Charlie
  • 222
  • 3
  • 20
  • Can you provide some more details – Sachin Yadav Nov 04 '19 at 18:36
  • I am not sure what more details I can give. But basically my ML Model is making a prediction that I would like it to not make. How do I re-train a model telling it that this one example is **not** this type of classification – Charlie Nov 04 '19 at 21:58
  • You state a very broad description of your problem here. When you say "it's impressive how accurate it is" have also looked at other metrics? Precision, Recall, Confusion Matrix? What about characteristics of your data? Is you data class-imbalanced? If you have data consisting of 99,9% class 0 and only 0,1% class 1, it won't be surprising you model has 99,9% accuracy by predicting just class 0. How many classes do you actually have? – Tinu Nov 05 '19 at 08:36
  • Have you looked at the wrongly predicted data points? Maybe there is a connection between them? Maybe they're mislabeled. – Tinu Nov 05 '19 at 08:38
  • I have three classes 1,2,3. Type 1 has 1578 Samples, Type 2 has 1231 and type 3 has 714 samples. Training - (Type 1) Precision 100% Recall 99% ,(Type 2) Precision 98% Recall 100%,(Type 3) Precision 98% Recall 97%. Validation - (Type 1) Precision 100% Recall 99% ,(Type 2) Precision 97% Recall 100%,(Type 3) Precision 95% Recall 93%. Testing - (Type 1) Precision 100% Recall 97% ,(Type 2) Precision 99% Recall 100%,(Type 3) Precision 94% Recall 99%. – Charlie Nov 05 '19 at 14:23
  • When I said it is impressive how accurate it is is because I have code that is also trying to figure out classification and I was impressed on some of the predictions was better than the code at figuring out small differences. Ultimately what I am trying to figure out is my MLModel made a prediction that I don't want it to be making as one of my three specified classes. How do i train a model to recognize that that prediction is wrong and should not be classified as one of my three classes. [Type 2 99%Right](https://pastebin.com/ukFXU0VK) [Type 2 94%Wrong](https://pastebin.com/eQWP9YTZ) – Charlie Nov 05 '19 at 14:41
  • better place to ask it: https://stats.stackexchange.com/questions – PV8 Nov 08 '19 at 13:02

1 Answers1

3

Disclaimer: So far I have not been working with createML. As I understand from the question you provide your training data via a folder structure and training and evaluation is then done by pressing a button. Correct me if I have made some bad assumption.

It would be nice to know what kind of model / architecture you are using and how your training samples look like.

To me your issue sounds like these poorly predicted samples might be underrepresented in your overall dataset. There are a few tricks you can try here:

  1. Just duplicate (copy-paste within your training sample folder) these samples for your training process, so as to double the error feedback on those particular samples.
  2. A more sophisticated approach would be to apply data augmentation strategies on those samples, and then add the augmented samples to your training data set.

Depending on your sample type, there are augmentation packages for Python available, and they are pretty easy and straight forward to use.

mrk
  • 8,059
  • 3
  • 56
  • 78
  • Your assumption is correct. I have two folders one that is data with three folders in it labeled one two and three. In these folders are CSV Files containing the data set that I have collected the CSV files contains accelerometer and gyroscope data points. The second folder is testing data 20% of my data set and that also has 3 folders one two and three each containing csv files. [CSVFileEample](https://pastebin.com/ena3hhwv) data augmentation from googling it seems to be right but how to do it on CSV files? – Charlie Nov 08 '19 at 15:23
  • Also overall it seems like I should be adding a new classification type folder titled four and put all these classifications i dont want as one,two or three in there and then data augment it so 20 becomes 700? – Charlie Nov 08 '19 at 15:24
  • If you now have four classes, yes you should add a fourth folder. Augmentation from 20 to 700 samples might be a far stretch though. – mrk Nov 10 '19 at 09:21
  • To answer the data augmentation question. One way to do that, would be loading the entries of your csv file into a numpy array and from there add some augmentation (noise, shifts, etc.) on your entry and finally save it again. Just pay attention and be careful to not disturb your data to an extent where it falls out of your actual data distribution. – mrk Nov 11 '19 at 07:51
  • Yes sorry went away for the weekend, Last thing when you say noise does that just mean randomly add or subtract small numbers to them? – Charlie Nov 12 '19 at 16:10
  • Yes that would be the simplest option. The values could also follow a distribution you sample from during augmentation. – mrk Nov 12 '19 at 17:40