How to select multiple columns based on their names in python?

Question

I am new to python so sorry if this is too obvious.

I have a dataframe that looks like below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 10))
df.columns = ['date1', 'date2', 'date3', 'name1', 'col1', 'col2', 'col3', 'name2', 'date4', 'date5']

    date1     date2     date3     name1      col1      col2      col3  \
0 -0.177090  0.417442 -0.930226  0.460750  1.062997  0.534942 -1.082967   
1 -0.942154  0.047837 -0.494979  2.437469 -0.446984  0.709556 -0.135978   
2 -1.544783  0.129307 -0.169556 -0.890697  2.650924  0.976610  0.290226   
3 -0.651220 -0.196342  0.712601  0.641927 -0.009921 -0.038450  0.498087   
4 -0.299145 -1.407747  1.914364  0.554330 -0.196702  2.037057 -0.287942   

    name2     date4     date5  
0 -0.318310  0.358619 -0.243150  
1  1.171024  0.277943 -1.584723  
2 -0.546707 -1.951831  0.678125  
3 -0.510261 -0.018574 -0.212684  
4  1.929841  0.995625 -1.125044

I'd like to to keep all columns that have, for example, 'date' in their names. That is, I want to keep columns 'date1', 'date2', 'date3', 'date4', 'date5', etc. In some statistical packages I can use * to represent all possible characters and use a command like this:

keep date*

Is there an equivalent way of doing this in python?

Thanks very much for any help.

possible duplicate of [Selecting columns](http://stackoverflow.com/questions/11285613/selecting-columns) — lvc, Mar 24 '14 at 04:22
This does go a little bit further than the question I suggested this dupes (which I didn't realise in time to retract my close vote, but I'll now vote to reopen), but you can build off the technique in the accepted answer there to get what you want. It tells you how to select multiple columns using a list of the column names - eg, `df['date1', 'date2']`. You can build that list dynamically to get what you want - `df[[col for col in df.columns if col.startswith('date')]]`. — lvc, Mar 24 '14 at 04:32
The shorter answer is to use `df.filter(like='date')` for the `keep date* ` equivalent (but off course, the solution of @lvc works also perfectly) — joris, Mar 24 '14 at 12:06
@joris, thanks! `df.filter` works beautifully. Just one more thing, if I want to drop all `date*` columns, how should I do it? I tried `like != date` but it does not work. If so, @lvc 's solution may be more flexible as I can use `df.drop` — Zhen Sun, Mar 26 '14 at 19:47
You can use regex expression inside `filter`, so you can use a regex expression that says to *not* match a certain string, eg `df.filter(regex="^(?!date).*$")`. In the upcoming pandas, this functionality will also be provided in `drop`, so this will be easier. — joris, Mar 26 '14 at 22:10

score 4 · Accepted Answer · answered Mar 30 '14 at 09:36

You can use the filter method. To do the equivalent of keep date*:

In [62]: df.filter(like='date')
Out[62]: 
      date1     date2     date3     date4     date5
0  0.091744 -0.431606  1.280286  0.263137  0.444550
1  0.688155  0.583918  0.957041  0.446047  1.654274
2  0.109004  0.608818  0.091498  0.940406  0.476479
3 -0.874016  1.312567  0.326480  1.213292  0.504049
4 -0.203515 -0.979086  0.458790  1.012296 -2.446310

The filter method has also a regex keyword, to do some more complex filtering.
Eg to drop all dates, you can provide a regex expression that says to not match a certain string: df.filter(regex="^(?!date).*$")

In the upcoming pandas (0.14), this functionality will also be provided in drop method, so this will be easier.

How to select multiple columns based on their names in python?

1 Answers1