2

So far I am able to get the list of all column names present in the dataframe or to get a specific column names based on its datatype, starting letters, etc...

Now my requirement is to get the whole list of column names or a sublist and to exclude one column from it (i.e Target variable / Label Column. This is a part of Machine Learning. So I am using the terms that are used in machine learning)

Please note I am not speaking about the data present in those columns. I am just taking the column names and want to exclude a particular column by its name

Please see below example for better understanding :

# Get all the column names from a Dataframe
df.columns

Index(['transactionID', 'accountID', 'transactionAmountUSD',
       'transactionAmount', 'transactionCurrencyCode',
       'accountAge', 'validationid', 'LABEL'],
      dtype='object')

# Get only the Numeric Variables (Columns with numeric values in it)
df._get_numeric_data().columns

Index(['transactionAmountUSD', 'transactionAmount', 'accountAge', 'LABEL'],
      dtype='object')

Now inorder to get remaining column names I am subtracting both the above commands

string_cols = list(set(list(df.columns))-set(df._get_numeric_data().columns))

Ok everything goes well until I hit this.

I have found out that Label column though it has numeric values it should not be present in the list of numeric variables. It should be excluded.

(i.e) I want to exclude a particular column name (not using its index in the list but using its name explicitly)

I tried similar statements like the following ones but in vain. Any inputs on this will be helpful

set(df._get_numeric_data().columns-set(df.LABEL)
set(df._get_numeric_data().columns-set(df.LABEL.column)
set(df._get_numeric_data().columns-set(df['LABEL'])

I am sure I am missing a very basic thing but not able to figure it out.

JKC
  • 2,498
  • 6
  • 30
  • 56

3 Answers3

1

Pandas' index supports set operations, so to exclude one column from column index you can just write something like

import pandas as pd
df = pd.DataFrame(columns=list('abcdef'))
print(df.columns.difference({'b'}))

which will return to you

Index(['a', 'c', 'd', 'e', 'f'], dtype='object')

I hope this is what you want :)

David Dale
  • 10,958
  • 44
  • 73
1

Considering LABEL column as your output and the other features as your input, you can try this:

feature_names = [x for x in df._get_numeric_data().columns if x not in ['LABEL']]
input = df[feature_names]
output= df['LABEL']

Hope this helps.

janu777
  • 1,940
  • 11
  • 26
1

First of all, you can exclude all numeric columns much more simply with

pd.DataFrame.select_dtypes(exclude=[np.number])

  transactionID accountID transactionCurrencyCode validationid
0             a         a                       a            a
1             a         a                       a            a
2             a         a                       a            a
3             a         a                       a            a
4             a         a                       a            a

Second of all, there are many ways to drop a column. See this post

df._get_numeric_data().drop('LABEL', 1)

   transactionAmountUSD  transactionAmount  accountAge
0                     1                  1           1
1                     1                  1           1
2                     1                  1           1
3                     1                  1           1
4                     1                  1           1

If you really wanted the columns, use pd.Index.difference

df._get_numeric_data().columns.difference(['LABEL'])

Index(['accountAge', 'transactionAmount', 'transactionAmountUSD'], dtype='object')

Setup

df = pd.DataFrame(
    [['a', 'a', 1, 1, 'a', 1, 'a', 1]] * 5,
    columns=[
        'transactionID', 'accountID', 'transactionAmountUSD',
        'transactionAmount', 'transactionCurrencyCode',
        'accountAge', 'validationid', 'LABEL']
)
piRSquared
  • 285,575
  • 57
  • 475
  • 624