One approach is to observe how the entropy of the class label distribution changes after you partition the class values according to attribute values for a given attribute. The attribute that gives the largest entropy reduction is the "best" one. (This works for discrete attributes only; you'll have to discretize the attributes in order to use this method; e.g. say convert hoursSlept>7
to sleptAlot
; convert 5 <=hoursSlept<=7
to sleptEnough
; and hoursSlept<5
to sleepDeprived
.)
The entropy H
of a discrete distribution (p1,p2,...,pk)
is defined as
H = -p1*log_2 p1 - p2*log_2 p2 - ... - pk*log_2 pk
and it measures, roughly speaking, the impurity of the distribution. The less you can tell about the outcome apriori the higher the entropy; the more you can tell about the outcome apriori the smalles the entropy. In fact, the distribution pi=1/k
for all i
(where all outcomes are equally likely) has the highest possible entropy (value log_2 k
); and distributions where pi=1
for some i
have the lowest possible entropy (value 0
).
Define pi=ni/n
where n
is the number of examples and ni
is the number of examples with i
-th class value. This induces a discrete distribution (p1,p2,...,pk)
where k
is the number of class values.
For an attribute A
with possible values a1,a2,...,ar
define Si
to be the set of those examples whose value of the attribute A
equals ai
. Each of the sets Si
induces a discrete distribution (defined in the same way as before). Let |Si|
be the number of examples in the set Si
. Denote the corresponding entropy by H(Si)
.
Now compute
Gain(A) = H - |S1|/n * H(S1) - ... - |Sr|/n * H(Sr)
and pick the attribute that maximizes Gain(A)
. The intuition is that the attribute that maximizes this difference partitions the examples so that in most Si
's the examples have similar labels (i.e. entropy is low).
Intuitively the value of Gain(A)
tells you how informative is the attribute A
about the class labels.
For your reference, this is widely used in decision-tree learning, and the measure is referred to as information gain. See, for example, these slides; this explanation on Math.SE is really great (although it's in the context of decision-tree learning).