Select two sets of columns by column names in Pandas

Question

Take the DataFrame in the answer of Loc vs. iloc vs. ix vs. at vs. iat? for example.

df = pd.DataFrame(
{'age':[30, 2, 12, 4, 32, 33, 69],
 'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
 'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
 'height':[165, 70, 120, 80, 180, 172, 150],
 'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
 'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},
 index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']
)

Now I want all columns except 'food' and 'height'.

I thought something like df.loc[:,['age':'color', 'score':'state']] would work, but Python returns SyntaxError: invalid syntax.

I am aware of that there is one way to work around: df.drop(columns = ['food', 'height']). However, in my real life situation, I have hundreds of columns to be dropped. Typing out all column names is so inefficient.

I am expecting something similar with dplyr::select(df, -(food:height)) or dplyr::select(df, age:color, score:state) in R language.

Also have read Selecting/Excluding sets of columns in Pandas.

Can you please explain why https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas did not work for you? It seems to have the solution to your problem. — cs95, Jan 20 '18 at 12:30
@cᴏʟᴅsᴘᴇᴇᴅ, Sorry but I don't see how it would work for me. The accepted answer of that thread gives answer like `df.drop(columns = ['food', 'height'])` like I mentioned in my post. Also all the answers there seem to need all the column names to be explicitly typed. Yet I have written in my post clearly that the column names I want to drop are too many to type. — ytu, Jan 20 '18 at 12:38

cs95 · Accepted Answer · 2018-01-20T13:01:16.333

1

First, find all columns lying between food and height (inclusive).

c = df.iloc[-1:0].loc[:, 'food':'height'].columns

Next, filter with difference/isin/setdiff1d -

df[df.columns.difference(c)]

Or,

df.loc[:, ~df.columns.isin(c)]

Or,

df[np.setdiff1d(df.columns, c)]

           age  color  score state
Jane        30   blue    4.6    NY
Nick         2  green    8.3    TX
Aaron       12    red    9.0    FL
Penelope     4  white    3.3    AL
Dean        32   gray    1.8    AK
Christina   33  black    9.5    TX
Cornelia    69    red    2.2    TX

edited Jan 20 '18 at 13:01

answered Jan 20 '18 at 12:19

cs95

379,657
97
704
746

@jezrael Maybe... I didn't look this time. If it is, I'll delete, no prob – cs95 Jan 20 '18 at 12:29
last link in question of OP. – jezrael Jan 20 '18 at 12:29
Thanks but I really need a solution which is more general and *does not* need to type all the column names explicitly. Also, the `~`makes me even confused. Is that something like `not` and `!`? Why is Python so inconsistent in "not" expression? – ytu Jan 20 '18 at 12:44
@ytu Also, the ~ is how you perform negation on dataframe and series. It is an overloaded version of the bitwise NOT operator, but semantically means something else. – cs95 Jan 20 '18 at 12:53
@cᴏʟᴅsᴘᴇᴇᴅ I really appreciate your help. Now `df[df.columns.difference(c)]` and `df[np.setdiff1d(df.columns, c)]` works for me, but `df[~df.columns.isin(c)]` gives me `ValueError: Item wrong length 6 instead of 7.` Would you mind checking? – ytu Jan 20 '18 at 13:00
@ytu Oh, my bad. Fixed. – cs95 Jan 20 '18 at 13:03

jezrael · Answer 2 · 2018-01-20T12:54:11.890

First get positions of columns names by Index.get_loc and then use numpy.r_ for join all slicers together:

a = np.r_[df.columns.get_loc('age'):df.columns.get_loc('color')+1, 
          df.columns.get_loc('score'):df.columns.get_loc('state')+1]

df = df.iloc[:, a]
print (df)
           age  color  score state
Jane        30   blue    4.6    NY
Nick         2  green    8.3    TX
Aaron       12    red    9.0    FL
Penelope     4  white    3.3    AL
Dean        32   gray    1.8    AK
Christina   33  black    9.5    TX
Cornelia    69    red    2.2    TX

score 0 · Answer 3 · answered Apr 23 '22 at 13:36

One option for flexible column selection is with select_columns from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df.select_columns(slice('age', 'color'), slice('score', 'state'))

           age  color  score state
Jane        30   blue    4.6    NY
Nick         2  green    8.3    TX
Aaron       12    red    9.0    FL
Penelope     4  white    3.3    AL
Dean        32   gray    1.8    AK
Christina   33  black    9.5    TX
Cornelia    69    red    2.2    TX

df.select_columns(slice('food', 'height'), invert = True)

           age  color  score state
Jane        30   blue    4.6    NY
Nick         2  green    8.3    TX
Aaron       12    red    9.0    FL
Penelope     4  white    3.3    AL
Dean        32   gray    1.8    AK
Christina   33  black    9.5    TX
Cornelia    69    red    2.2    TX

Select two sets of columns by column names in Pandas

3 Answers3