0

I have a dataset that was generated from IOT device and I'm trying to predict '1' that a machine will break down (Rare Event) and '0' that it will not. The dataset is highly imbalanced and I'm considering using LSTM for prediction. I'm not sure how to prepare my data for this task. Do I remove all zero values per rows since most columns contain this. Only few of those columns do not contain outliers. Below is an example of what the distribution of my data looks like but not entirely. FYI, I have more columns not included in the snapshot and about 75% of the columns in the data are like this.

enter image description here

Omomaxi
  • 107
  • 3
  • 9
  • You can have a look at stratified imbalanced sampling methods. You also should decide if your focus is Precision or Recall, or a mixture of both. – Andreas Aug 25 '21 at 00:29
  • Does this answer your question? [Stratified Train/Test-split in scikit-learn](https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn) – Andreas Aug 25 '21 at 00:30
  • @ Andreas. Thanks for the suggestion but it doesn't address my question. What I need is not how to split my data. What I'm confused about is mainly how to clean my data before getting to cross validation. I stated in my question that there are so many zero values in the input variables that is my reason for including the distribution of my data to give an idea of what I'm working with. Please, refer to my snapshot. Thanks – Omomaxi Aug 25 '21 at 01:01

1 Answers1

0

The common approach when dealing with imbalanced datasets is to use resampling techniques such as undersampling and oversampling. In Python, imbalanced-learn is a popular library used for both of these methods.

Undersampling remove samples from the majority class where oversampling duplicates samples from the minority class. Oversampling is generally preferred as you are not removing data. Lastly, you can use an advanced oversampling technique called SMOTE to create new synthetic minority class data. This is generally most performant, see here for additional info.

pirateofebay
  • 930
  • 1
  • 10
  • 25
  • @ pirateofebay. Thanks for your suggestion. Yes, I was planning on using SMOTE and oversampling of the minority class. However, these data aren't necessarily missing, they just contain lots of outliers ( the zero values) across various input variables. That is what I'm asking about. What the best approach is to handle such a situation like this in getting my data ready for modelling. Thanks – Omomaxi Aug 25 '21 at 16:36