Vowpal Wabbit Contextual Bandit Data Formatting

Question

I have 2 questions about formatting data for contextual bandit model training.

If I have data such as below...

1:1:0.2 | d1:us d2:female d3:12

Question 1) I read from VW Wiki that each feature is optionally followed by a float. In case where I have categorical features (such as us, female) as values, what is the best way to re-format them? I am thinking that I would just not suffix them with a float let them have a default value of 1. I'm hoping this would achieve one-hot encoding.

Question 2) I've been wrongly training the model by logging the data as below

1:1:0.2 | us female 12

What I now realize is that "us", "female", and "12" are treated as features with default values as 1. Am I correct?

score 3 · Accepted Answer · answered Jan 17 '17 at 01:41

Yes, you're correct.

The input feature format is: space-separated with each feature as <name>:<value> where :<value>, if present, must be numeric.

To represent categorical values you could use something other than : as separator between <name> and <value>. In this case the whole string would be considered the feature name. This is often called "one-hot encoding" (each possible feature+value combo is treated as a separate feature).

Also note that the feature name 12 will be hashed by vw directly to slot 12 (modulo 2^bits) in the hash table, assuming this is what the user wanted, since numeric features are common (and are the libSVM convention). This can be disabled by the option --hash all on the command line. The default is --hash strings meaning: (murmur3) hash feature-names which look like a string (not an integer) but leave alone (don't hash) feature names that look like numbers.

See also: How to represent categorical features in vowpal-wabbit which includes a cheat-sheet for representing input features in vw.

Vowpal Wabbit Contextual Bandit Data Formatting

1 Answers1