1

I am trying to train a LambdaMART model to perform a pairwise sort of a list of objects. My training dataset consists of 50,000, 112-dimensional feature vectors. Each feature is coded by a non-negative integer.

The target value is a positive integer (not consecutive). Given two new instances, X and Y, I want my model to be able to predict if the target value for X is greater than Y.

Since this is not an information retrieval application, the concept of a query is irrelevant. All 50,000 instances belong to the same "query".

It seems that when I run my model, even with a setting to use a 70%/30% train-validate split, I get 0 deviance on my validation set, and the gbm.perf function throws an exception if I try to do OOB method for finding optimal number of trees.

Overall, I'm pretty confused as to what this package is doing with all these unhelpfully named parameters. All I want to know do is specify a test-validation set and then minimize the validation error over the range of tree sizes. Shouldn't be too much, but this package is making it so difficult to know which knobs I need to set...so much so that I'm about to implement it myself just so I have some transparency and know what its doing.

Sorry for the rant, but I could use some help to get this pacakge to return meaningful validation results.

  • This question appears to be off-topic because it is about statistics and not really a specific programming question. Perhaps it's better to ask this on [Cross Validated](http://stats.stackexchange.com) – Jaap Aug 23 '15 at 05:42
  • @Jaap question has nothing to do with statistics, OP asks about usage of a particular implementation in a specific programming language – lejlot Aug 23 '15 at 05:52
  • @lejlot OP is talking about how to get the parameters of a model right. Besides that it appears to be more about statistics, the question is also lacking a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and as it's currently stated also looks more like a request for a tutorial. – Jaap Aug 23 '15 at 06:11
  • I do not say it is a good question, simply state that it is not about statistics, he clearly know what to do, The only problem is R api – lejlot Aug 23 '15 at 06:27
  • @Jaap: lejlot is correct, I am having issues sorting out the api. I know what I want to do and I could, if I felt masochistic, program this algorithm myself, but I want to use this package for expediancy purposes. From past experience, the CrossValidated folks will bump a post like this to Stack Overflow. –  Aug 23 '15 at 17:47
  • @Jaap however, your point is well taken on an example. I will post the code and description tat is giving me trouble. –  Aug 23 '15 at 17:47
  • @Bey ok, i will retract my close vote – Jaap Aug 23 '15 at 17:50

1 Answers1

0

I don't think LambdaMart is ideal for your use case. The algorithm assumes that data consists of groups of multiple items each; the objective is a function of the overall arrangement of items in a group. Therefore, to split the data into train and validation sets, all items belonging to the same group together are assigned either to the former or to the latter. In GBM, what constitutes a group is specified with the group parameter, which is a list of column names. All instances that agree on these columns belong to the same group.

In your scenario, you have a single large "group" that consists of all items in the training set; therefore, it cannot be split into train and validation sets. I see two options for you:

  1. Train a least squares model directly on the target value ("gaussian" distribution argument).
  2. (Better but more complex) If the intended use of your model after training is to predict, given a pair of items, whether the first one should be preferred to the second, you could reformat your data accordingly: 112 feature columns for the first item, plus 112 feature columns for the second item, plus a binary target that is 1 iff item 1 has a higher target in the original data set. Then, train a logistic model on that (distribution "bernoulli").
stefan.schroedl
  • 866
  • 9
  • 19