14

I have the following data with all categorical variables:

    class  education    income    social_standing
    1       basic       low       good
    0        low        high      V_good
    1        high       low       not_good
    0        v_high     high      good

Here education has four levels (basic, low, high and v_high). income has two levels low and high ; and social_standing has three levels (good, v_good and not_good).

In so far as my understanding of converting the above data to VW format is concerned, it will be something like this:

    1 |person education_basic income_low social_standing_good
    0 |person education_low income_high social_standing_v_good
    1 |person education_high income_low social_standing_not_good
    0 |person education_v_high income_high social_standing_good

Here, 'person', is namespace and all other are feature values, prefixed by respective feature names. Am I correct? Somehow this representation of feature values is quite perplexing to me. Is there any other way to represent features? Shall be grateful for help.

Ashok K Harnal
  • 1,191
  • 2
  • 15
  • 28

1 Answers1

29

Yes, you are correct.

This representation would definitely work with vowpal wabbit, but under some conditions, may not be optimal (it depends).

To represent non-ordered, categorical variables (with discrete values), the standard vowpal wabbit trick is to use logical/boolean values for each possible (name, value) combination (e.g. person_is_good, color_blue, color_red). The reason this works is that vw implicitly assumes a value of 1 whereever a value is missing. There's no practical difference between color_red, color=red, color_is_red, or even (color,red) and color_red:1 except hash locations in memory. The only characters you can not use in a variable name are the special separators (: and |) and white-space.

Terminology note: this trick of converting each (feature + value) pair into a separate feature is sometimes called "One Hot Encoding".

But in this case the variable-values may not be "strictly categorical". They may be:

  • Strictly ordered, e.g (low < basic < high < v_high)
  • Presumably have a monotonic relation with the label you're trying to predict

so by making them "strict categorical" (my term for a variable with a discrete range which doesn't have the two properties above) you may be losing some information that may help learning.

In your particular case, you may get better result by converting the values to numeric, e.g. (1, 2, 3, 4) for education. i.e you could use something like:

1 |person education:2 income:1 social_standing:2
0 |person education:1 income:2 social_standing:3
1 |person education:3 income:1 social_standing:1
0 |person education:4 income:2 social_standing:2

The training set in the question should work fine, because even when you convert all your discrete variables into boolean variables like you did, vw should self-discover both the ordering and the monotonicity with the label from the data itself, as long as the two properties above are true, and there's enough data to deduce them.

Here's the short cheat-sheet for encoding variables in vowpal wabbit:

Variable type       How to encode                readable example
-------------       -------------                ----------------
boolean             only encode the true case    is_alive
categorical         append value to name         color=green
ordinal+monotonic   :approx_value                education:2
numeric             :actual_value                height:1.85

Final notes:

  • In vw all variables are numeric. The encoding tricks are just practical ways to make things appear as categorical or boolean. Boolean variables are simply numeric 0 or 1; Categorical variables can be encoded as boolean: name+value:1.
  • Any variable whose value is not monotonic with the label, may be less useful when numerically encoded.
  • Any variable that is not linearly related to the label may benefit from a non-linear transformation before training.
  • Any variable with a zero value will not make a difference to the model (exception: when the --initial_weight <value> option is used) so it can be dropped from the training set
  • When parsing a feature, only : is considered a special separator (between the variable name and its numeric value) anything else is considered a part of the name and the whole name string is hashed to a location in memory. A missing :<value> part implies :1

Edit: what about name-spaces?

Name spaces are prepended to feature names with a special-char separator so they map identical features to different hash locations. Example:

|E low |I low

Is essentially equivalent to the (no name spaces flat example):

|  E^low:1 I^low:1

The main use of name-spaces is to easily redefine all members of a name-space to something else, ignore a full name space of features, cross features of a name space with another etc. (see -q, --cubic, --redefine, --ignore, --keep options).

arielf
  • 5,802
  • 1
  • 36
  • 48
  • 2
    Thanks. Very Very clear. Except that by boolean (first row of the cheat-sheet) am I to understand logical or binary? That is, if it is binary (with any two values and not necessarily in the sense of True/False), I will not have to prefix column name but just write the value as it is but in other three cases column names will have to precede values in some form. – Ashok K Harnal Feb 21 '15 at 10:35
  • 1st line 'boolean' means logical: either true or false (0 or 1). 2nd line: categorical may assume any number of discrete values, including 2. – arielf Feb 21 '15 at 19:57
  • 2
    Thanks. All these clarifications make using Vowpal Wabbit much easier. – Ashok K Harnal Feb 22 '15 at 00:58
  • 3
    @arielf Thank you for the exhaustive answer! The only question: do I need to explicitly specify prefixes for categorical features in _different_ namespaces? That is, will this stuff: `|E low |I low` be treated by vw as different features or not? From [input format doc](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format) I get an idea that it will. – kurtosis Oct 23 '15 at 09:30
  • 2
    Great question I too share! – B_Miner Jan 14 '16 at 01:58
  • 2
    @kurtosis - yes, different name spaces will map the same feature-names into different locations - i.e. separate features. I will add this to the answer. – arielf Jan 15 '16 at 03:01
  • I think I get this part, but not 100% sure > "Any variable that is not linearly related to the label may benefit from a non-linear transformation before training." – matanster Apr 16 '18 at 15:00
  • 1
    @matanster : at its core (default options) `vw` is a linear learner, so it does linear regression. If you have some input `x` where (for example) `log(x)` is better correlated with the `y` label than a straight `x`, applying the `log(x)` transformation before feeding the input into `vw` should produce a better model. `vw` doesn't do math transforms automatically. You can get some non-linearity with non-linear options like `--link logistic`, `--nn `, or `--quadratic XY` but these don't work directly on the individual inputs; they work on the "link" function that combines the inputs. – arielf Apr 16 '18 at 18:34
  • Thanks, this is what I thought (of course, a-priori guessing how x and y correlate is kind of an intuition or something known from past data). I am still unsure why does it treat a value of zero the way it does other than when the feature is a 0/1 feature, and what range exactly are numeric feature values allowed to be. I'd like to assume 0/1 semantics only apply to categorical features whereas numerical ones are allowed any range on the real numbers axis. Maybe I'd edit the answer to make this note. – matanster Apr 16 '18 at 19:37
  • 1
    In `vw` models start with 0.0 weight assumed everywhere. In order to learn, any new info should pull-or-push weights away from 0.0. That's why logistic regression (2-way classification) in `vw` needs labels to be {-1, 1} rather than {0, 1}. There are no categorical values in `vw`, all labels and feature values are numeric (floating point) internally and all the update computation is numeric. The above cheat-sheet only suggest a practical way to make it _appear_ as if values are categorical/boolean etc. Internally, they aren't. – arielf Apr 16 '18 at 20:16
  • 1
    This information should be added to the VW format page of the VowpalWabbit wiki pages. Happy to do it myself if you agree @arielf. – Pawelek Jan 23 '20 at 17:45