503

I would like to create views or dataframes from an existing dataframe based on column selections.

For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')

# This returns an error ("unhashable type: set")
df2 = df[my_cols]

What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

9 Answers9

706

You can either Drop the columns you do not need OR Select the ones you need

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)

# Select the ones you want
df1 = df[['a','d']]
Brian Burns
  • 20,575
  • 8
  • 83
  • 77
Amrita Sawant
  • 10,403
  • 4
  • 22
  • 26
  • 4
    is there a way of just selecting the `n`'th column? ie: `df1 = my_df[[n]]` where _n_ is the last column of _n_ number of columns, I'm trying negative indexing `[[-1]]`<- it isn't working :( help!~ – 3kstc Mar 08 '18 at 03:26
  • 6
    @3kstc you can use `df1.iloc[:, [-1]]` this will select all rows, last column. – Kocur4d Apr 25 '18 at 10:16
  • 1
    what about using `isin` for columns? No need to drop or loop. See my answer below. – MrE Jul 30 '18 at 20:43
  • 2
    You can also type `df.drop(columns=['B', 'C'])` – Levi Baguley Aug 23 '21 at 21:13
232

There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.

Here, the result is used to remove columns B and D from df:

df2 = df[df.columns.difference(['B', 'D'])]

Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.


Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:

# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns

# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])

df = df.drop_duplicates(subset=subset)
IanS
  • 15,771
  • 9
  • 60
  • 84
  • 3
    Nice. Is there an advantage/disadvantage to doing it this way versus `df.drop(['B', 'D'], axis=1)`? – Ben Sep 29 '16 at 21:23
  • 3
    For me the advantage is code readability. I find the use of `drop` to select columns counter-intuitive. – IanS Sep 30 '16 at 08:30
  • 17
    `difference()` also seems to reorder columns alphabetically by default – slizb Dec 01 '16 at 22:16
  • 5
    @slizb good point, based on the [source code](https://github.com/pandas-dev/pandas/blob/093aa8231eae92ff6cf7ef9564d62289b458aaff/pandas/core/common.py#L352) it looks like there is actually no guarantee about the order of the index returned, as the method uses sets. – IanS Dec 02 '16 at 08:46
  • 4
    It seems that the drop method is slightly faster (~515 µs vs ~680 µs), at least in some tests on a 15611 rows x 5 columns dataframe of which I wanted to drop 3 columns, in python 3.6 and pandas 0.20.3. – bli Nov 08 '17 at 17:12
  • 6
    Sorting can be prevented by passing `sort=False` to `difference()`. – billiam Jul 24 '20 at 16:15
  • index has a drop method too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.drop.html – Levi Baguley May 06 '21 at 23:54
  • you can `df.drop(['B','D'], axis=1, inplace=True)` so you don't make a non-needed copy. – PatrickT Jan 07 '22 at 20:12
160

Another option, without dropping or filtering in a loop:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically 
df[df.columns.difference(['C', 'D'])]
MrE
  • 19,584
  • 12
  • 87
  • 105
89

You don't really need to convert that into a set:

cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
piggybox
  • 1,689
  • 1
  • 15
  • 19
23

Also have a look into the built-in DataFrame.filter function.

Minimalistic but greedy approach (sufficient for the given df):

df.filter(regex="[^BD]")

Conservative/lazy approach (exact matches only):

df.filter(regex="^(?!(B|D)$).*$")

Conservative and generic:

exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
Frank
  • 469
  • 3
  • 4
15

You have 4 columns A,B,C,D

Here is a better way to select the columns you need for the new dataframe:-

df2 = df1[['A','D']]

if you wish to use column numbers instead, use:-

df2 = df1[[0,3]]
Kapil Marwaha
  • 949
  • 9
  • 9
  • `pandas` treats lists and tuples differently as indexes. so`df[('A', 'B')] != df[["A", "B"]]` - note that `df["A", "B"] == df[("A", "B")]` Not sure I'm a fan.... though there's a quetion as to whether the utility justifies the craziness. – Att Righ Jul 02 '21 at 15:00
10

You just need to convert your set to a list

import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
tacaswell
  • 84,579
  • 22
  • 210
  • 199
  • Thanks! This works perfectly well. Just wondering, are there any features in Panda that facilitate the specification of columns for this type (or more sophisticated types) of column filtering? – Amelio Vazquez-Reina Feb 18 '13 at 17:17
  • @user273158 Don't know, I am just starting to learn `Pandas` my self. – tacaswell Feb 18 '13 at 17:20
  • 3
    Maybe use [drop](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop)?. `df.drop(my_cols, axis=1)` will produce a view of the DataFrame with the dropped columns. All you need is then to assign it to the new DF: `df2 = df.drop(my_cols, axis=1)` – herrfz Feb 18 '13 at 18:33
  • Use [drop][1] as in this answer to another question: http://stackoverflow.com/a/18145399/115237 [1]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop – Noel Evans Aug 12 '14 at 11:28
8

Here's how to create a copy of a DataFrame excluding a list of columns:

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)

But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)

This method doesn't achieve that:

>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
     A         B         C         D
0  999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
          A         C
0  0.251262 -1.980673

Note also that this is also true of @piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)

For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
8

In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.

As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:

skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)

The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.


Given

Data in a file

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

filename = "foo.csv"
df.to_csv(filename)

Code

skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1

Output

          A         C
0  0.062350  0.076924
1 -0.016872  1.091446
2  0.213050  1.646109
3 -1.196928  1.153497
4 -0.628839 -0.856529
...

Details

A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).

Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.

A request was raised for a "skipcols" option in this issue and was addressed in a later issue.

pylang
  • 40,867
  • 14
  • 129
  • 121