There are two separate questions here:
Q1: Why can't you (and shouldn't you) use transformations like tf-idf
when using vowpal wabbit
?
A1: vowpal wabbit
is not a batch learning system, it is an online-learning system. In order to compute measures like tf-idf
(term frequency in each document vs the whole corpus) you need to see all the data (corpus) first, and sometimes do multiple passes over the data. vowpal wabbit
as an online/incremental learning system is designed to also work on problems where you don't have the full data ahead of time. See This answer for a lot more details.
Q2: How does vowpal wabbit
"transform" the features it sees ?
A2: It doesn't. It simply maps each word feature on-the-fly to its hashed location in memory. The online learning step is driven by a repetitive optimization loop (SGD or BFGS) example by example, to minimize the modeling error. You may select the loss function to optimize for.
However, if you already have the full data you want to train on, nothing prevents you from transforming it (using any other tool) before feeding the transformed values to vowpal wabbit
. It's your choice. Depending on the particular data, you may get better or worse results using a transformation pre-pass, than by running multiple passes with vowpal wabbit
itself without preliminary transformations (check-out the vw --passes
option).
To complete the answer, let's add another related question:
Q3: Can I use pre-transformed (e.g. tf-idf
) data with vowpal wabbit
?
A3: Yes, you can. Just use the following (post-transformation) form. Instead of words, use integers as feature IDs and since any feature can have an optional explicit weight, use the tf-idf
floating point as weights, following the :
separator in typical SVMlight format:
-1 | 1:0.534 15:0.123 3:0.27 29:0.066 ...
1 | 3:0.1 102:0.004 24:0.0304 ...
The reason this works, is because vw
has a nice feature of distinguishing between string and integer-features. It doesn't hash feature-names that look like integers (unless you use the --hash_all
option explicitly). Integer feature numbers are used directly as if they were the hash result of the feature.