I am using vowpal wabbit to train and learn a contextual bandit algorithm. My use case is for email marketing, learning which email variants perform better. Since the email reward proportion will be very low - only 1% of the emails will be clicked (click here is the reward). How to handle this huge imbalance scenario in vowpal wabbit learning ?
With 1% of the observations having a reward (i.e, = -cost), the model is not able to learn anything even after running for longer durations. What are some options in vowpal wabbit that can help to address this, I am looking for syntax for options (to address imbalance in training) and examples in vowpal wabbit to address this, but couldn't find any.