0

I have been given a set of testing data which was classified by 3 people whether it was true or false. I also was given the confidence - for example sometimes 2/3 agreed in one direction. How can I incorporate this into my classifier models. I have looked into SGDClassifier which has the class_weight param and so does SVM. I am then iterating each of the confidence levels and for each row of data assigning the weight of 3 or 2 depending on whether all three were classified the same or not:

x=0
weights = {}
for d in confidence:
    val = int(d[1])
    if(val == 1):
        weight = 3
    else: # d=0.66
        weight = 2
    x = x+1
    weights[x] = weight

Unfortunately then, when running:

SGDClassifier(class_weight=weights)

I get the error:

Class label 2 not present.

What am I doing wrong?

maxisme
  • 3,974
  • 9
  • 47
  • 97
  • What is the format of `confidence`? Also, did you check if you are populating your dictionary right? How many keys are in the dictionary? `print weights.keys()`? That error usually happens when your `class_weight` dictionary does not have at least two weights (i.e. only one weight). – rayryeng May 23 '17 at 18:50
  • confidence is either `1.0` or `0.66` – maxisme May 23 '17 at 18:51
  • That didn't quite answer my question. Is `confidence` a list of numbers? Your code is currently not able to reproduce the errors you are experiencing primarily due to the lack of specifying what `confidence` is. – rayryeng May 23 '17 at 18:52
  • sorry my laptop just ran out of battery! `weights.keys()` = `[1, 2, 3, 4, 5, 6, 7, 8, 9,...]` – maxisme May 23 '17 at 19:08
  • and `len(weights)` is the same as `len(X)` i.e one for each row of data – maxisme May 23 '17 at 19:10
  • @rayreng yes confidence is a len(X) of floats of either `1.0` or `0.66` – maxisme May 23 '17 at 19:11
  • Sounds like what you have is a **sample weight**. This is not the same as `class_weight`. However, the `.fit` methods allows you to specify a `sample_weight`. – MB-F May 24 '17 at 13:39
  • @kazemakase yes. Unfortunately. What is the best way of handling sample weights? I am currently creating two sets of models one with the 2/3 probs and one with the 3/3 then doing a soft `VotingClassifier` where the 3/3 is weighted more than the 2/3. Is that a good idea? – maxisme May 24 '17 at 13:42
  • The [`.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.fit) method allows you to specify a `sample_weight`. – MB-F May 24 '17 at 13:43
  • For grid search should I pass that as `param` into the `param_grid` or use the grid search `fit_params`? – maxisme May 24 '17 at 13:47
  • I think `fit_params` is the correct way. (Because the docs say **fit_params:** *Parameters to pass to the fit method.*) – MB-F May 24 '17 at 13:49
  • but I immagine that then won't output the fit_params in `gs.best_estimator_` ? – maxisme May 24 '17 at 13:50
  • Why should it? The weights are a property of the data. They are the same for every grid run. – MB-F May 24 '17 at 13:51
  • Because grid search trains the best params for the `SVC` model for example. When then running the model on the testing data do I apply the param `sample_weight` or will that have been accounted for by the choice of params returned by the grid search? – maxisme May 24 '17 at 13:54
  • Please see the debate: https://stackoverflow.com/a/27682281/2768038 – maxisme May 24 '17 at 13:54
  • 1
    You need the sample weight only for training. It simply tells the training algorithm how much it should trust each data point. For testing you don't need it - the classifier does not care how much you trust a tested sample. – MB-F May 24 '17 at 13:57
  • valid valid valid. Sorry that was stupid. – maxisme May 24 '17 at 13:57

1 Answers1

1

The confidence of a data point should be expressed as a sample_weight rather than a class_weight.

The .fit methods of some classifiers take a sample_weight argument.

There is an example in the scikit-learn documentation that shows how to do this with a Support Vector Classifier. Relevant excerpt:

# fit the model
clf_weights = svm.SVC()
clf_weights.fit(X, y, sample_weight=sample_weight_last_ten)
MB-F
  • 22,770
  • 4
  • 61
  • 116
  • would you be able to give an example of `sample_weight_last_ten` in my case of `2/3` or `3/3` – maxisme May 24 '17 at 13:51
  • @Maximilian you did not provide much detail in your question but it looks like you can simply pass `sample_weights=confidence`. Note that the length of `confidence`, `y`, and the number of rows in `X` need to be the same. – MB-F May 24 '17 at 13:54
  • I just mean should I set the confidence of 2/3 as `2` and the confidence of 3/3 as `3` or `1` and `2` or `66` and `100` or `0.66` and `1`? – maxisme May 24 '17 at 13:56
  • @Maximilian I think only the ratio matters. Since you already have 0.66 and 1 use that. But I'm not much of an expert on this so play around a bit and see what works best ;) – MB-F May 24 '17 at 14:00