I am currently using the Vowpal Wabbit package in order to simulate a Contextual Bandit. I had a couple of questions regarding the usage of the library:
- I have multiple contexts/categories where the actions are intersecting. For example, lets say I have jerseys of Team A, Team B and Team C. These jerseys come in sizes S, M and L. Based on past demand, I want to recommend a size of jersey to produce.
Contexts - Team A, Team B, Team C Actions - S, M and L
Each context has the same set of actions to choose from. I want Vowpal Wabbit to understand that each context is different, and create separate distributions of the action space. Vowpal Wabbit is utilizing the same distribution/pmf for the actions across all contexts.
So if, Team A is the context - The distribution is [0.1, 0.8, 0.1] after several runs. Team B also has the same distribution [0.1, 0.8, 0.1] even though VW has not seen this as an input, ideally I would want it to start from [0.33,0.33,0.33]
Is there a way I can utilize VW to differentiate contexts and give them separate distributions?
I am simulating the Contextual Bandit with Vowpal Wabbit with the following settings - "--cb_explore_adf --save_resume --quiet --epsilon 0.1"
- I was also wondering if there was a way to access/view the underlying learnt policy? Where are the different distributions or learnt policies stored?
Thanks