2

In sklearn when we pass sentence to algorithms we can use text features extractors like the countvectorizer, tf-idf vectoriser etc... And we get an array of floats.

But what we get when passed to vowpal wabbit the input file like this one:

-1 |Words The sun is blue
1 |Words The sun is yellow

What is used in internal implementation of vowpal wabbit? How does this text transform?

arielf
  • 5,802
  • 1
  • 36
  • 48
Andrei
  • 1,313
  • 4
  • 18
  • 35

1 Answers1

6

There are two separate questions here:

Q1: Why can't you (and shouldn't you) use transformations like tf-idf when using vowpal wabbit ?

A1: vowpal wabbit is not a batch learning system, it is an online-learning system. In order to compute measures like tf-idf (term frequency in each document vs the whole corpus) you need to see all the data (corpus) first, and sometimes do multiple passes over the data. vowpal wabbit as an online/incremental learning system is designed to also work on problems where you don't have the full data ahead of time. See This answer for a lot more details.

Q2: How does vowpal wabbit "transform" the features it sees ?

A2: It doesn't. It simply maps each word feature on-the-fly to its hashed location in memory. The online learning step is driven by a repetitive optimization loop (SGD or BFGS) example by example, to minimize the modeling error. You may select the loss function to optimize for.

However, if you already have the full data you want to train on, nothing prevents you from transforming it (using any other tool) before feeding the transformed values to vowpal wabbit. It's your choice. Depending on the particular data, you may get better or worse results using a transformation pre-pass, than by running multiple passes with vowpal wabbit itself without preliminary transformations (check-out the vw --passes option).

To complete the answer, let's add another related question:

Q3: Can I use pre-transformed (e.g. tf-idf) data with vowpal wabbit ?

A3: Yes, you can. Just use the following (post-transformation) form. Instead of words, use integers as feature IDs and since any feature can have an optional explicit weight, use the tf-idf floating point as weights, following the : separator in typical SVMlight format:

-1 |  1:0.534  15:0.123  3:0.27  29:0.066  ...
1  |  3:0.1  102:0.004  24:0.0304  ...

The reason this works, is because vw has a nice feature of distinguishing between string and integer-features. It doesn't hash feature-names that look like integers (unless you use the --hash_all option explicitly). Integer feature numbers are used directly as if they were the hash result of the feature.

Community
  • 1
  • 1
arielf
  • 5,802
  • 1
  • 36
  • 48
  • Awesome answer! Thank a lot! But I have one question more. If I want to learn binary sigmoidal feedforward network, is it necessary to use --passes arg? For what it used for? And how to test predicted model? – Andrei Mar 30 '17 at 21:28
  • 1
    You may always use `--passes ` with `vw` as if it was a batch setting where all data is known in advance. It is your choice. Be aware though, that multiple passes may lead to over-fitting. Also check out the `--holdout_period ` and the `--bootstrap ` options to help in avoiding over-fitting. For more options and usage, check out the full tutorial and documentation on github.com – arielf Mar 30 '17 at 21:50
  • @arief And what about test model? Is it okey to use "vw --binary --nn 4 train.vw -f category.model" and then "vw --binary -t -i category.model -p est.vw" ? When I add in second cmd argument --nn it give me: Error: option '--nn' cannot be specified more than once – Andrei Mar 30 '17 at 21:54
  • 1
    It's not actually true that VW cannot transform the features. There are many "transformations" available, e.g. `--quadratic` (and `--cubic` and in general `--interactions`) or `--dictionary`. If the features are words, you can use `--ngram`, `--skip`, `--affix` or `--spelling`. See `vw -h` for details. – Martin Popel Mar 31 '17 at 11:57
  • Martin, fair enough, all these options do some very simple preprocessing, like combining existing features, and adding/removing features on the fly. My point was that `vw` doesn't support arbitrary feature transforms like `tf-idf`. Even simple math function transforms (like say, `log`, or `sqrt`) aren't currently supported in the core itself. – arielf Mar 31 '17 at 23:14
  • 1
    @Andrei when you test (`-i modelfile -t ...`) the `--nn 4` is already read from the saved model, so just drop it from the test command line to make it work. – arielf Mar 31 '17 at 23:16
  • Let's assume we have the complete training and testing sets in advance. Does it make sense to train on both, raw text and tfidf features put into 2 separate namespaces? Does it add anything? Can it lead to overfit? – Mischa Lisovyi Nov 15 '18 at 12:05
  • 1
    @MykhailoLisovyi training on both complicates the model, with some (plausibly large) overlap/redundancy so my intuition says it is better to avoid it. If you know `tf-idf`, the additional features would be weaker, noisier, and won't help much. Would it overfit? That depends on many factors. Most importantly, the ratio of the number of examples (data-set length) to the model-complexity (number of features, data-set width). If you have too few examples on too many features, this would generally lead to overfitting, but if you have a lot of examples, you're safer against over-fitting. – arielf Nov 16 '18 at 22:24