What feature selection algorithm would I use to find out which feature impacts each class the most?

Question

I am currently experimenting with a data set,using supervised learning with 10 features and 3 classes, but a question arose that is, what feature selection algorithm would I use to find out which feature impacts which class the most, or which combination of features will result in what class.

For Example Take a data set of Hours Slept and Hours studied which may result in a pass or a fail.

I want to know How does Hours Studies impact the pass class and how it impacts the fail class and the same for Hours Slept how does it impact pass or fail.

What Feature selection method will tell me that Hours Slept has x impact on Fail and y on Pass, and the same for Hours studied?

score 2 · Accepted Answer · edited May 23 '17 at 12:32

One approach is to observe how the entropy of the class label distribution changes after you partition the class values according to attribute values for a given attribute. The attribute that gives the largest entropy reduction is the "best" one. (This works for discrete attributes only; you'll have to discretize the attributes in order to use this method; e.g. say convert hoursSlept>7 to sleptAlot; convert 5 <=hoursSlept<=7 to sleptEnough; and hoursSlept<5 to sleepDeprived.)

The entropy H of a discrete distribution (p1,p2,...,pk) is defined as

H = -p1*log_2 p1 - p2*log_2 p2 - ... - pk*log_2 pk

and it measures, roughly speaking, the impurity of the distribution. The less you can tell about the outcome apriori the higher the entropy; the more you can tell about the outcome apriori the smalles the entropy. In fact, the distribution pi=1/k for all i (where all outcomes are equally likely) has the highest possible entropy (value log_2 k); and distributions where pi=1 for some i have the lowest possible entropy (value 0).

Define pi=ni/n where n is the number of examples and ni is the number of examples with i-th class value. This induces a discrete distribution (p1,p2,...,pk) where k is the number of class values.

For an attribute A with possible values a1,a2,...,ar define Si to be the set of those examples whose value of the attribute A equals ai. Each of the sets Si induces a discrete distribution (defined in the same way as before). Let |Si| be the number of examples in the set Si. Denote the corresponding entropy by H(Si).

Now compute

Gain(A) = H - |S1|/n * H(S1) - ... - |Sr|/n * H(Sr)

and pick the attribute that maximizes Gain(A). The intuition is that the attribute that maximizes this difference partitions the examples so that in most Si's the examples have similar labels (i.e. entropy is low).

Intuitively the value of Gain(A) tells you how informative is the attribute A about the class labels.

For your reference, this is widely used in decision-tree learning, and the measure is referred to as information gain. See, for example, these slides; this explanation on Math.SE is really great (although it's in the context of decision-tree learning).

Lovely,Thanks Blazs I think I understand now, Really detailed, It led me to another question tho. What do I use to get the impact on a specific attribute values on the class Eg lets say we discretized the attributes as you mentioned hours slept would be the attribute and sleptAlot would be the attribute value I am interested in knowing how it impacts pass or fail. Thanks!!! — Sean Sog Miller, Apr 04 '16 at 22:30
So the question is how do the individual attribute values affect the class? You could quantify this by computing `H(S_sleptAlot)` and then comparing it to `H` (e.g. take the difference). If you want to know how the attribute value affects the actual class values, you can look at the distribution itself. — blazs, Apr 05 '16 at 08:24

What feature selection algorithm would I use to find out which feature impacts each class the most?

1 Answers1