2

I'm working on a machine learning problem in which I've a multi-label target where each label is a probability. In the past I've worked with multi-label problems, but each label was binary. For example, if the target was a series of book topics ('Python', 'Machine Learning', 'Fairy Tales', 'Cooking') a Machine Learning book based on Python's scikit learn would have a target of [1, 1, 0, 0].

Now, for the problem I'm trying to solve, my target are not binary. The target itself is a series of probabilities like [0.75, 0.25, 0, 0]. I think the target was produced in a crowd sourcing fashion, and these probabilities reflect the variability of people's judgment. So, unless I want to bucket probabilities in classes (i.e p<0.5 ->0, p>=0.5 ->1), I'm stuck with a regression problem where the target needs to be constrained between 0 and 1. Any ideas of what type of algorithm I could try? I'm using Python's scikit learn.

Thanks!

ADJ
  • 4,892
  • 10
  • 50
  • 83
  • If you only have a target with probabilities, I think you *are* stuck bucketing values like you proposed. Of course, a logistic regression does work with probabilities but I'm not sure if it is exactly suitable for what you are doing. – BlackVegetable Nov 06 '13 at 19:38
  • As my experience tells, if those probabilities are indeed judjment based, bucketing is the option, since amount of noise in those data tends to be enormous. But it also can be some ML generated probabilities, from a NN or LDA or something similar. In latter case it's up to you to make a decision about cost function, since say `0.1` differs from `0.2` not the same way as `0.8` from `0.9` – alko Nov 06 '13 at 21:51
  • @EMS: no, these probabilities are all over the place. – ADJ Nov 06 '13 at 22:36
  • What I know is that these are not experts, and thus a system of reliability is in use. People who tend to get it right, have more weight than those who don't. The final probability is a weighted average of each person, were weights are base on how reliable each person is. – ADJ Nov 06 '13 at 23:00

2 Answers2

2

One option is to use a Multilayer Perceptron, since it does not require binary target values and can easily handle target values constrained to the range [0, 1] (i.e., when using a sigmoid or tanh activation function). You can also normalize the outputs to ensure that the probabilities of the multiple classes sum to unity.

For additional info, there are numerous resources on the web (try searching on the terms "multilayer perceptron probability output") but you might start here or here.

Community
  • 1
  • 1
bogatron
  • 18,639
  • 6
  • 53
  • 47
  • Thanks, this is helpful. However I'm constrained by Python scikit for the time being and Neural Networks are not supported. – ADJ Nov 07 '13 at 21:31
  • sklearn does provide a [Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron), although it appears to only support a single layer. – bogatron Nov 08 '13 at 00:31
  • The base perceptron (single unit) can hardly considered a neural "network". However there is an [ongoing pull request](https://github.com/scikit-learn/scikit-learn/pull/2120) to add an MLP implementation to scikit-learn. – ogrisel Nov 08 '13 at 15:44
2

Can you treat those crowd-sourced probabilities as label weights? Then you might consider training algorithms that can take into consideration the label weight, e.g., linear classifier or boosting algorithm.

For example, if a naive bayesian classifier is used, we used to treat each label having label weight 1 and now each label has a fractional label weight associated. If this is an application of document classification, we might have a set of ground-truth labels for two training instances like below:

1. {News: 0.8, Sports: 0.5}
2. {News: 0.1, Sports: 0.8}

Suppose you have a word w1 which appears 5 times in the first instance, and 2 times in the second instance.

When you calculate the probability for word w1 given a class label, you perform:

P(w1 | News) = (5*0.8 + 2*0.1) / (#of weighted occurrences of all words in all your News docs)
P(w1 | Sports) = (5*0.5 + 2*0.8) / (# weighted occurrences of all words in all your Sports docs)

Notice how the label weights are taken into account when we learn the model. Essentially the number of times a word appears gets a discounted credit.

greeness
  • 15,956
  • 5
  • 50
  • 80
  • I could, and actually thought about it. I just haven't found a way to do it in Python's sklearn yet... – ADJ Nov 06 '13 at 22:39
  • Then it's a different question. Maybe modify the source code of sklearn yourself depending on which algorithm you choose? – greeness Nov 06 '13 at 22:45
  • Would having weighted observations achieve the same end result? – ADJ Nov 07 '13 at 21:30
  • You mean the feature values are weighted? sure for linear classifiers. In the same document classification example, you can replace the # occurrence with a tf-idf score, which is an example of weighted feature. – greeness Nov 07 '13 at 22:50
  • IF you mean you want to pass the weight from labels to the features, it depends on the learning algorithm. At least for naive bayesian, it is probably equivalent due to the way that we use the weight by "multiply" feature value and label weight, so that you can treat the label weight as a "feature weight". – greeness Nov 07 '13 at 23:04