13

With R, one can ignore a variable (column) while building a model with the following syntax:

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)

It's very handy when your data set contains indexes or ID.

How would you do that with SKlearn in python, assuming your data are Pandas data frames ?

Mathieu
  • 431
  • 1
  • 6
  • 15
  • Your passed data will be numpy arrays or perhaps pandas (which decomposes to numpy arrays) you can index and select the columns of interest and pass these to `lm` function see [numpy docs](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) and [pandas docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html) – EdChum May 01 '14 at 10:17
  • Well, I have more than 300 columns. If I have to proceed manually, I'd rather ignore 1 column than add 299. – Mathieu May 01 '14 at 10:27
  • 3
    Well in pandas you can do `list(df.columns) - ignore_col` which would not be too verbose – EdChum May 01 '14 at 10:37
  • How do I feed the list of retained columns into SKlearn ? – Mathieu May 01 '14 at 11:03
  • 2
    You typically call `fit` passing the data and labels as the X, y params e.g. `clf = LogisticRegression()` then `clf.fit(X,y)`, if were using pandas then the last line would become `clf.fit(df[list(df.columns) - ignoreCol].values, df.target.values)`, this is a pseudo example but it should demonstrate what I mean there is an example here using pandas with sklearn: http://python.dzone.com/articles/python-making-scikit-learn-and – EdChum May 01 '14 at 11:10
  • If you need to ignore columns based on regex: https://stackoverflow.com/a/62220733/317797 – BSalita Jun 16 '20 at 08:29
  • Would I be wrong to suggest this issue begs for a new feature to sklearn models? I too have found the need for a ignorex parameter when instancing models. – BSalita Jun 16 '20 at 11:38

1 Answers1

13

So this is from my own code I used to do some prediction on StackOverflow last year:

from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier

basic_feature_names = [ 'BodyLength'
                      , 'NumTags'
                      , 'OwnerUndeletedAnswerCountAtPostTime'
                      , 'ReputationAtPostCreation'
                      , 'TitleLength'
                      , 'UserAge' ]

fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now 
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])

So if we wanted a subset of the features for classification I could have done this:

# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')

clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)

So the idea here is that a list can be used to select a subset of the columns in the pandas dataframe, as such we can construct a new list or remove a value and use this for selection

I'm not sure how you could do this easily using numpy arrays as indexing is done differently.

EdChum
  • 376,765
  • 198
  • 813
  • 562