How does mllib weight the classes internally for unbalanced datasets?

Question

I have a dataframe with 1% positive classes (1's) and 99% negatives (0's) and I am working with a Logistic Regression in Pyspark. I rode here about dealing with unbalanced datasets, and the solution is to add a weightCol, as it says in the answer provided in the link, in order to tell the model to focus more on the 1's, as there are less.

I've tried it and it works well, but I don't know how mllib balances the data internally. Someone has a clue ? I don't like working with "black boxes" I can't comprehend.

score 0 · Answer 1 · answered May 23 '19 at 23:02

from the Spark documentation it says

We implemented two algorithms to solve logistic regression: mini-batch gradient descent and L-BFGS. We recommend L-BFGS over mini-batch gradient descent for faster convergence.

You can check the LBFGS.scala to see the how optimization algorithm updates the weights after each iteration.

How does mllib weight the classes internally for unbalanced datasets?

1 Answers1