Handling imbalanced class in Spark

Question

I am trying to experiment with credit card fraud detection dataset through spark mllib. The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud). I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE. I am using logistic regression as the model

I did not tried it, but I was searching for the answer to the same question as you. I found an implementation (not tested/validated) of SMOTE in Spark: https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b. Also, there is a discussion about same problem and a suggested solution is to use weights (https://stackoverflow.com/questions/33372838/dealing-with-unbalanced-datasets-in-spark-mllib), but in the example, the classes are not so unbalanced as it would be in a fraud data set. — waltersantosf, Dec 22 '17 at 17:57

score 1 · Answer 1 · edited Feb 14 '19 at 21:02

You can try weightCol within logistic regression, Something like this:

    temp = train.groupby("LabelCol").count()
    new_train = train.join(temp, "LabelCol", how = 'leftouter')
    num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
    train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
    # Logistic Regrestion Initiation
    lr = LogisticRegression(weightCol = "weight", family = 'multinomial')

Handling imbalanced class in Spark

1 Answers1