8

If I use undersampling in case of an unbalanced binary target variable to train a model, the prediction method calculates probabilities under the assumption of a balanced data set. How can I convert these probabilities to actual probabilities for the unbalanced data? Is the a conversion argument/function implemented in the mlr package or another package? For example:

a <- data.frame(y=factor(sample(0:1, prob = c(0.1,0.9), replace=T, size=100)))
a$x <- as.numeric(a$y)+rnorm(n=100, sd=1)
task <- makeClassifTask(data=a, target="y", positive="0")
learner <- makeLearner("classif.binomial", predict.type="prob")
learner <- makeUndersampleWrapper(learner, usw.rate = 0.1, usw.cl = "1")
model <- train(learner, task, subset = 1:50)
pred <- predict(model, task, subset = 51:100)
head(pred$data)
tover
  • 535
  • 4
  • 11

1 Answers1

10

A very simple yet powerful method has been proposed by [Dal Pozzolo et al., 2015].

Paper Title: "Calibrating Probability with Undersampling for Unbalanced Classification" Andrea Dal Pozzolo , Olivier Caelen† , Reid A. Johnson , Gianluca Bontempi

It is specifically designed to tackle the issue of calibration (i.e. transforming predicted probabilities of your classifier into atcual probabilities in the unbalanced case) in the case of downsampling.

You just have to correct your predicted probability p_s using the following formula:

   p = beta * p_s / ((beta-1) * p_s + 1)

where beta is the ratio of the number majority class instances after undersampling over the number majority class ones in the original training set.

Other methods Other methods which are not specifically focused on the downsampling bias have been proposed. Among which the most popular ones are the following:

They are both implemented in R

Chuck
  • 3,664
  • 7
  • 42
  • 76
Pop
  • 12,135
  • 5
  • 55
  • 68
  • I also found a different formula: 1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result-1)); It is described here: http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/ and also uses the "oversampled" fraction. The two formulas give somewhat different results. Does anyone have an idea which one is better/when to use which one? – tover Jul 25 '17 at 12:39
  • 1
    I haven't read thoroughly your article but it is about **oversampling** the minority class while Dal Pozzolo's formula is when you do **undersampling** on the majority class. So they do not apply in the same cases – Pop Jul 25 '17 at 12:42
  • In this article they mean undersampling when they say "oversampling". – tover Jul 25 '17 at 12:50
  • I have to correct myself: The results from both formulas are almost exactly identical (at least in the example I have used). However I'm still curious about what's the point of using the more complicated formula from data-mining.blog or if there may be larger differences for other cases. – tover Jul 25 '17 at 14:18
  • 1
    It's hard to tell because there is no derivation of this formula nor explanation in this blog... It may be worth a question in stats.stackexchange – Pop Jul 25 '17 at 14:41
  • I did some empirical testing and it seems that Pozzolo's formula only works for fully balanced data (i.e. both target categories have identical frequency in the downsampled data) – tover Jul 28 '17 at 14:54
  • Is `where beta is the ratio of the number majority class instances after undersampling over the number majority class ones in the original training set.` So if the majority class was the negative class, and this was reduced from `n` to `m` the beta = n/m. But what if the majority class was the positive class. Wouls this change anything? Thanks – Chuck Apr 06 '20 at 12:32