Multi-armed bandits thompson sampling for non-binary rewards

Question

I use the following line to update my beta distribution in each trial and give arm recommendation (I use scipy.stats.beta) :

self.prior = (1.0,1.0)
def get_recommendation(self):
    sampled_theta = []
    for i in range(self.arms):
        #Construct beta distribution for posterior
        dist = beta(self.prior[0]+self.successes[i],
                    self.prior[1]+self.trials[i]-self.successes[i])
        #Draw sample from beta distribution
        sampled_theta += [ dist.rvs() ]
    # Return the index of the sample with the largest value
    return sampled_theta.index( max(sampled_theta) )

But currently, it only works in the rewards are binary (either it's success or failure). I want to modify it so it works for non-binary rewards. (e.g. rewards: 2300, 2000,...). How do I do that?

You can't use a beta-binomial model for continuous rewards. The likelihood function is binomial, which is a discrete random variable representing counts (hence the success or failure). You'll need a new model - consider using a normal distribution. — ilanman, Nov 12 '16 at 02:55
How about using the expected reward as the probability? Normalized of course so that each binomial probability is below 1.0. E.g arm a has a probability of 0.01% and reward 2300 so the expected reward would be 0.23. — sroecker, Jan 24 '17 at 15:27

Multi-armed bandits thompson sampling for non-binary rewards

0 Answers0