1

I m trying to predict a Variable y from a set of features X where X at start are 36 features. I have two questions concerning this:

  1. How to handle boolean-attributes (0,1) while creating polynomial features? It doesn't make sense to square them for example.

Code I Have so far:

poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X_train)
  1. How to make a feature selection for polynomial regression? Because creating polynomial features of degree 2 for 36 variables increases the size of X drasticly. Is there a Method to run a selection which returns the best model based on MSE for example?
Alanovic
  • 51
  • 1
  • 7

1 Answers1

0
  1. True, there is no point in taking the squares of boolean features. One solution is to use PolynomialFeatures with the option interaction_only=True so you'll only get their products. The product in the case of booleans is actually an AND. You may also write your own function to get other combinations like OR or XOR.

  2. Depending on the number of original features, it may or may be not time-consuming to perform an exhaustive search over all possible feature combinations. I guess it's the latter case. Then you could:

a) use LASSO regression (or elastic net) that automatically performs variable selection

b) try tree-base methods for the same reason (e.g. random forest)

c) try some feature selection methods (e.g. chi-square)

Stergios
  • 3,126
  • 6
  • 33
  • 55
  • Thanks for your reply! But how to handle a feature set which contains both boolean and numeric features? For example two features x1,x2 where x1 is boolean. How to generate function like y = x0 + w1*x1+ w2*x2 + w3*x1x2 + w4*x²? So ignoring the boolean for higher degree and only generate interaction for it but generate higher polynomial for x2? – Alanovic Mar 10 '16 at 01:22
  • You didn't mention you had both boolean and numeric features in your initial post. Anyway, if you don't want to write your own function to do it, you could just use PolynomialFeatures with interaction_only=False and then delete any duplicate features [which would be the squares of the boolean features]. Check here http://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns – Stergios Mar 11 '16 at 08:34