I am trying to experiment with credit card fraud detection dataset through spark mllib. The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud). I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE. I am using logistic regression as the model
Asked
Active
Viewed 3,253 times
1
-
1I did not tried it, but I was searching for the answer to the same question as you. I found an implementation (not tested/validated) of SMOTE in Spark: https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b. Also, there is a discussion about same problem and a suggested solution is to use weights (https://stackoverflow.com/questions/33372838/dealing-with-unbalanced-datasets-in-spark-mllib), but in the example, the classes are not so unbalanced as it would be in a fraud data set. – waltersantosf Dec 22 '17 at 17:57
-
@waltersantosf thanks a lot!! – Ayan Biswas Dec 23 '17 at 11:54
1 Answers
1
You can try weightCol within logistic regression, Something like this:
temp = train.groupby("LabelCol").count()
new_train = train.join(temp, "LabelCol", how = 'leftouter')
num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
# Logistic Regrestion Initiation
lr = LogisticRegression(weightCol = "weight", family = 'multinomial')

liamconnell
- 3
- 2

Harsiddhi
- 11
- 3