Parameter selection in Adaboost

Question

After using OpenCV for boosting I'm trying to implement my own version of the Adaboost algorithm (check here, here and the original paper for some references).

By reading all the material I've came up with some questions regarding the implementation of the algorithm.

1) It is not clear to me how the weights a_t of each weak learner are assigned.

In all the sources I've pointed out the choice is a_t = k * ln( (1-e_t) / e_t ), k being a positive constant and e_t the error rate of the particular weak learner.

At page 7 of this source it says that that particular value minimizes a certain convex differentiable function, but I really don't understand the passage.

Can anyone please explain it to me?

2) I have some doubts on the procedure of weight update of the training samples.

Clearly it should be done in such a way to guarantee that they remain a probability distribution. All the references adopt this choice:

D_{t+1}(i) = D_{t}(i) * e^(-a_ty_ih_t(x_i)) / Z_t (where Z_t is a normalization factor chosen so that D_{t+1} is a distribution).

But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?

I hope this is the right place to post this question, if not please redirect me!
Thanks in advance for any help you can provide.

Are you familiar with convex optimization at all ? If not, explaining this passage will take a while (a college course on optimization takes a semester) — AlexK, May 20 '13 at 21:45
Yes, a different weight update scheme is possible if you choose a different objective function (look up "objective function" in a convex optimization book). For a different weight update scheme google "LogitBoost", for a guide on convex optimization see http://www.stanford.edu/~boyd/cvxbook/ — AlexK, May 20 '13 at 21:52
@AlexK - I am familiar with convex optimization, but still I would need some explanations. Can you help me? — Matteo, May 20 '13 at 22:05
Here is a presentation I did on AB a while ago: http://www.cs.ucf.edu/courses/cap6411/cap6411/spring2006/Lecture10.pdf On slide 4, see if you can take a derivative of Gab w.r.t alpha, equate it to 0 and derive the optimal alpha for AdaBoost update. That would be a good starting point for your first question. — AlexK, May 20 '13 at 22:21
@AlexK - thks for your tutorial! I studied it the last couple days and it helped me understand some aspects, but not everything is clear to me...Any chances you can come up with a complete answer? — Matteo, May 23 '13 at 17:22

score 1 · Answer 1 · answered Sep 07 '13 at 22:51

1) Your first question:

a_t = k * ln( (1-e_t) / e_t )

Since the error on training data is bounded by product of Z_t)alpha), and Z_t(alpha) is convex w.r.t. alpha, and thus there is only one "global" optimal alpha which minimize the upperbound of the error. This is the intuition of how you find the magic "alpha"

2) Your 2nd question: But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?

To cut it short: the intuitive way of finding the above alpha is indeed improve the accuracy. This is not surprising: you are actually trusting more (by giving larger alpha weight) of the learners who work better than the others, and trust less (by giving smaller alpha) to those who work worse. For those learners brining no new knowledge than the previous learners, you assign weight alpha equal 0.

It is possible to prove (see) that the final boosted hypothesis yielding training error bounded by

exp(-2 \sigma_t (1/2 - epsilon_t)^2 )

3) Your 3rd question: Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?

This is hard to say. But just remember here the update is improving the accuracy on the "training data" (at the risk of over-fitting), but it is hard to say about its generality.

Parameter selection in Adaboost

1 Answers1