Stochastic Gradient Descent classifier with pre defined weights

Question

I have been given a set of testing data which was classified by 3 people whether it was true or false. I also was given the confidence - for example sometimes 2/3 agreed in one direction. How can I incorporate this into my classifier models. I have looked into SGDClassifier which has the class_weight param and so does SVM. I am then iterating each of the confidence levels and for each row of data assigning the weight of 3 or 2 depending on whether all three were classified the same or not:

x=0
weights = {}
for d in confidence:
    val = int(d[1])
    if(val == 1):
        weight = 3
    else: # d=0.66
        weight = 2
    x = x+1
    weights[x] = weight

Unfortunately then, when running:

SGDClassifier(class_weight=weights)

I get the error:

Class label 2 not present.

What am I doing wrong?

What is the format of `confidence`? Also, did you check if you are populating your dictionary right? How many keys are in the dictionary? `print weights.keys()`? That error usually happens when your `class_weight` dictionary does not have at least two weights (i.e. only one weight). — rayryeng, May 23 '17 at 18:50
That didn't quite answer my question. Is `confidence` a list of numbers? Your code is currently not able to reproduce the errors you are experiencing primarily due to the lack of specifying what `confidence` is. — rayryeng, May 23 '17 at 18:52
sorry my laptop just ran out of battery! `weights.keys()` = `[1, 2, 3, 4, 5, 6, 7, 8, 9,...]` — maxisme, May 23 '17 at 19:08
and `len(weights)` is the same as `len(X)` i.e one for each row of data — maxisme, May 23 '17 at 19:10
@rayreng yes confidence is a len(X) of floats of either `1.0` or `0.66` — maxisme, May 23 '17 at 19:11
Sounds like what you have is a **sample weight**. This is not the same as `class_weight`. However, the `.fit` methods allows you to specify a `sample_weight`. — MB-F, May 24 '17 at 13:39
@kazemakase yes. Unfortunately. What is the best way of handling sample weights? I am currently creating two sets of models one with the 2/3 probs and one with the 3/3 then doing a soft `VotingClassifier` where the 3/3 is weighted more than the 2/3. Is that a good idea? — maxisme, May 24 '17 at 13:42
The [`.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.fit) method allows you to specify a `sample_weight`. — MB-F, May 24 '17 at 13:43
For grid search should I pass that as `param` into the `param_grid` or use the grid search `fit_params`? — maxisme, May 24 '17 at 13:47
I think `fit_params` is the correct way. (Because the docs say **fit_params:** *Parameters to pass to the fit method.*) — MB-F, May 24 '17 at 13:49
but I immagine that then won't output the fit_params in `gs.best_estimator_` ? — maxisme, May 24 '17 at 13:50
Why should it? The weights are a property of the data. They are the same for every grid run. — MB-F, May 24 '17 at 13:51
Because grid search trains the best params for the `SVC` model for example. When then running the model on the testing data do I apply the param `sample_weight` or will that have been accounted for by the choice of params returned by the grid search? — maxisme, May 24 '17 at 13:54
Please see the debate: https://stackoverflow.com/a/27682281/2768038 — maxisme, May 24 '17 at 13:54
You need the sample weight only for training. It simply tells the training algorithm how much it should trust each data point. For testing you don't need it - the classifier does not care how much you trust a tested sample. — MB-F, May 24 '17 at 13:57

score 1 · Accepted Answer · answered May 24 '17 at 13:48

1

The confidence of a data point should be expressed as a sample_weight rather than a class_weight.

The .fit methods of some classifiers take a sample_weight argument.

There is an example in the scikit-learn documentation that shows how to do this with a Support Vector Classifier. Relevant excerpt:

# fit the model
clf_weights = svm.SVC()
clf_weights.fit(X, y, sample_weight=sample_weight_last_ten)

answered May 24 '17 at 13:48

MB-F

22,770
4
61
116

would you be able to give an example of `sample_weight_last_ten` in my case of `2/3` or `3/3` – maxisme May 24 '17 at 13:51
@Maximilian you did not provide much detail in your question but it looks like you can simply pass `sample_weights=confidence`. Note that the length of `confidence`, `y`, and the number of rows in `X` need to be the same. – MB-F May 24 '17 at 13:54
I just mean should I set the confidence of 2/3 as `2` and the confidence of 3/3 as `3` or `1` and `2` or `66` and `100` or `0.66` and `1`? – maxisme May 24 '17 at 13:56
@Maximilian I think only the ratio matters. Since you already have 0.66 and 1 use that. But I'm not much of an expert on this so play around a bit and see what works best ;) – MB-F May 24 '17 at 14:00

Stochastic Gradient Descent classifier with pre defined weights

1 Answers1