Vowpal Wabbit Contextual Bandit correct usage

Question

I am currently using the Vowpal Wabbit package in order to simulate a Contextual Bandit. I had a couple of questions regarding the usage of the library:

I have multiple contexts/categories where the actions are intersecting. For example, lets say I have jerseys of Team A, Team B and Team C. These jerseys come in sizes S, M and L. Based on past demand, I want to recommend a size of jersey to produce.

Contexts - Team A, Team B, Team C Actions - S, M and L

Each context has the same set of actions to choose from. I want Vowpal Wabbit to understand that each context is different, and create separate distributions of the action space. Vowpal Wabbit is utilizing the same distribution/pmf for the actions across all contexts.

So if, Team A is the context - The distribution is [0.1, 0.8, 0.1] after several runs. Team B also has the same distribution [0.1, 0.8, 0.1] even though VW has not seen this as an input, ideally I would want it to start from [0.33,0.33,0.33]

Is there a way I can utilize VW to differentiate contexts and give them separate distributions?

I am simulating the Contextual Bandit with Vowpal Wabbit with the following settings - "--cb_explore_adf --save_resume --quiet --epsilon 0.1"

I was also wondering if there was a way to access/view the underlying learnt policy? Where are the different distributions or learnt policies stored?

Thanks

Please provide enough code so others can better understand or reproduce the problem. — Community, Sep 14 '22 at 10:13

score 1 · Answer 1 · answered Sep 29 '22 at 18:27

For VW to understand that each context is different, you need to add "-q CA" to do feature interactions between the context feature and action feature. Since you already trained the model with Team A, when training for Team B, the model weight has already been updated, so it won't be uniform random anymore. Maybe you can try add --ignore_linear C and --ignore_linear A? Also curious why would you want the action distribution to be uniform random for Team B?

To access/view the learnt policy you can try "--readable_model READABLE_MODEL_PATH". To save the different distributions you can do "-p PREDICTION_FILE_PATH", to save the learnt policy "-f MODEL_PATH". For more options about learnt policy: https://vowpalwabbit.org/docs/vowpal_wabbit/python/latest/command_line_args.html#output-model-options

I want my model to differentiate between contexts and provide unique distribution to each one. If the model is trained with Team A, when a new context ( Team B ) arrives, you would want it to have a unique distribution(which starts from a uniform distribution), I do no want the distribution for Team B to remain uniform throughout. I have tried using "-q CA" and didn't see any changes in distribution. Thanks for your comment about readable model, I will give that a try! — theamar961, Oct 05 '22 at 18:51

Vowpal Wabbit Contextual Bandit correct usage

1 Answers1