Selecting/excluding sets of columns in pandas

Question

I would like to create views or dataframes from an existing dataframe based on column selections.

For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')

# This returns an error ("unhashable type: set")
df2 = df[my_cols]

What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?

score 706 · Accepted Answer · edited Jun 12 '18 at 16:16

706

You can either Drop the columns you do not need OR Select the ones you need

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)

# Select the ones you want
df1 = df[['a','d']]

edited Jun 12 '18 at 16:16

Brian Burns

20,575
8
83
77

answered Mar 28 '15 at 15:54

Amrita Sawant

10,403
4
22
26

4

is there a way of just selecting the `n`'th column? ie: `df1 = my_df[[n]]` where _n_ is the last column of _n_ number of columns, I'm trying negative indexing `[[-1]]`<- it isn't working :( help!~ – 3kstc Mar 08 '18 at 03:26
6

@3kstc you can use `df1.iloc[:, [-1]]` this will select all rows, last column. – Kocur4d Apr 25 '18 at 10:16
1

what about using `isin` for columns? No need to drop or loop. See my answer below. – MrE Jul 30 '18 at 20:43
2

You can also type `df.drop(columns=['B', 'C'])` – Levi Baguley Aug 23 '21 at 21:13

IanS · Answer 2 · 2019-12-17T16:19:32.387

232

There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.

Here, the result is used to remove columns B and D from df:

df2 = df[df.columns.difference(['B', 'D'])]

Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.

Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:

# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns

# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])

df = df.drop_duplicates(subset=subset)

edited Dec 17 '19 at 16:19

answered May 25 '16 at 15:12

IanS

15,771
9
60
84

3

Nice. Is there an advantage/disadvantage to doing it this way versus `df.drop(['B', 'D'], axis=1)`? – Ben Sep 29 '16 at 21:23
3

For me the advantage is code readability. I find the use of `drop` to select columns counter-intuitive. – IanS Sep 30 '16 at 08:30
17

`difference()` also seems to reorder columns alphabetically by default – slizb Dec 01 '16 at 22:16
5

@slizb good point, based on the [source code](https://github.com/pandas-dev/pandas/blob/093aa8231eae92ff6cf7ef9564d62289b458aaff/pandas/core/common.py#L352) it looks like there is actually no guarantee about the order of the index returned, as the method uses sets. – IanS Dec 02 '16 at 08:46
4

It seems that the drop method is slightly faster (~515 µs vs ~680 µs), at least in some tests on a 15611 rows x 5 columns dataframe of which I wanted to drop 3 columns, in python 3.6 and pandas 0.20.3. – bli Nov 08 '17 at 17:12
6

Sorting can be prevented by passing `sort=False` to `difference()`. – billiam Jul 24 '20 at 16:15
index has a drop method too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.drop.html – Levi Baguley May 06 '21 at 23:54
you can `df.drop(['B','D'], axis=1, inplace=True)` so you don't make a non-needed copy. – PatrickT Jan 07 '22 at 20:12

MrE · Answer 3 · 2021-01-13T17:51:36.340

160

Another option, without dropping or filtering in a loop:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically 
df[df.columns.difference(['C', 'D'])]

edited Jan 13 '21 at 17:51

answered Jul 30 '18 at 20:41

MrE

19,584
12
87
105

Nice. Failure of `list(df.columns).remove('C')` was driving me nuts. – PatrickT Jan 07 '22 at 20:17

score 89 · Answer 4 · answered Nov 18 '13 at 02:25

89

You don't really need to convert that into a set:

cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]

answered Nov 18 '13 at 02:25

piggybox

1,689
1
15
19

3

I timed this, difference and ```[df.columns[~df.columns.isin['B','D']]```. This answer was quickest. Thanks. – rahul-ahuja Jan 08 '21 at 02:25

score 23 · Answer 5 · answered Oct 14 '14 at 09:02

23

Also have a look into the built-in DataFrame.filter function.

Minimalistic but greedy approach (sufficient for the given df):

df.filter(regex="[^BD]")

Conservative/lazy approach (exact matches only):

df.filter(regex="^(?!(B|D)$).*$")

Conservative and generic:

exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))

answered Oct 14 '14 at 09:02

Frank

469
3
4

This way allows for method chaining. – Eric Ness Dec 14 '18 at 15:44
This is nice to select columns, but seems very messy to exclude them! – PatrickT Jan 07 '22 at 20:07

score 15 · Answer 6 · answered Jun 18 '18 at 13:56

15

You have 4 columns A,B,C,D

Here is a better way to select the columns you need for the new dataframe:-

df2 = df1[['A','D']]

if you wish to use column numbers instead, use:-

df2 = df1[[0,3]]

answered Jun 18 '18 at 13:56

Kapil Marwaha

949
9
9

`pandas` treats lists and tuples differently as indexes. so`df[('A', 'B')] != df[["A", "B"]]` - note that `df["A", "B"] == df[("A", "B")]` Not sure I'm a fan.... though there's a quetion as to whether the utility justifies the craziness. – Att Righ Jul 02 '21 at 15:00

score 10 · Answer 7 · answered Feb 18 '13 at 16:32

10

You just need to convert your set to a list

import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]

answered Feb 18 '13 at 16:32

tacaswell

84,579
22
210
199

Thanks! This works perfectly well. Just wondering, are there any features in Panda that facilitate the specification of columns for this type (or more sophisticated types) of column filtering? – Amelio Vazquez-Reina Feb 18 '13 at 17:17
@user273158 Don't know, I am just starting to learn `Pandas` my self. – tacaswell Feb 18 '13 at 17:20
3

Maybe use [drop](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop)?. `df.drop(my_cols, axis=1)` will produce a view of the DataFrame with the dropped columns. All you need is then to assign it to the new DF: `df2 = df.drop(my_cols, axis=1)` – herrfz Feb 18 '13 at 18:33
Use [drop][1] as in this answer to another question: http://stackoverflow.com/a/18145399/115237 [1]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop – Noel Evans Aug 12 '14 at 11:28

score 8 · Answer 8 · edited May 23 '17 at 12:18

Here's how to create a copy of a DataFrame excluding a list of columns:

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)

But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)

This method doesn't achieve that:

>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
     A         B         C         D
0  999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
          A         C
0  0.251262 -1.980673

Note also that this is also true of @piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)

For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.

pylang · Answer 9 · 2018-06-04T17:32:26.780

In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.

As of pandas 0.20.0, usecols now accepts callables.¹ This update allows more flexible options for reading columns:

skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)

The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.

Given

Data in a file

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

filename = "foo.csv"
df.to_csv(filename)

Code

skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1

Output

          A         C
0  0.062350  0.076924
1 -0.016872  1.091446
2  0.213050  1.646109
3 -1.196928  1.153497
4 -0.628839 -0.856529
...

Details

A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).

Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.

^{_{A request was raised for a "skipcols" option in this issue and was addressed in a later issue.}}

Thank you so much! This is what I needed. – Apr 26 '21 at 04:15 — , Apr 26 '21 at 04:15

Selecting/excluding sets of columns in pandas

9 Answers9

Linked