3

I am new to python so sorry if this is too obvious.

I have a dataframe that looks like below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 10))
df.columns = ['date1', 'date2', 'date3', 'name1', 'col1', 'col2', 'col3', 'name2', 'date4', 'date5']

    date1     date2     date3     name1      col1      col2      col3  \
0 -0.177090  0.417442 -0.930226  0.460750  1.062997  0.534942 -1.082967   
1 -0.942154  0.047837 -0.494979  2.437469 -0.446984  0.709556 -0.135978   
2 -1.544783  0.129307 -0.169556 -0.890697  2.650924  0.976610  0.290226   
3 -0.651220 -0.196342  0.712601  0.641927 -0.009921 -0.038450  0.498087   
4 -0.299145 -1.407747  1.914364  0.554330 -0.196702  2.037057 -0.287942   

    name2     date4     date5  
0 -0.318310  0.358619 -0.243150  
1  1.171024  0.277943 -1.584723  
2 -0.546707 -1.951831  0.678125  
3 -0.510261 -0.018574 -0.212684  
4  1.929841  0.995625 -1.125044  

I'd like to to keep all columns that have, for example, 'date' in their names. That is, I want to keep columns 'date1', 'date2', 'date3', 'date4', 'date5', etc. In some statistical packages I can use * to represent all possible characters and use a command like this:

keep date* 

Is there an equivalent way of doing this in python?

Thanks very much for any help.

joris
  • 133,120
  • 36
  • 247
  • 202
Zhen Sun
  • 817
  • 3
  • 13
  • 20
  • "Dataframe"? You mean in pandas? – Ignacio Vazquez-Abrams Mar 24 '14 at 02:29
  • please show some code and couple of lines of sample data – m.wasowski Mar 24 '14 at 02:30
  • I updated the question with sample data. – Zhen Sun Mar 24 '14 at 04:09
  • possible duplicate of [Selecting columns](http://stackoverflow.com/questions/11285613/selecting-columns) – lvc Mar 24 '14 at 04:22
  • This does go a little bit further than the question I suggested this dupes (which I didn't realise in time to retract my close vote, but I'll now vote to reopen), but you can build off the technique in the accepted answer there to get what you want. It tells you how to select multiple columns using a list of the column names - eg, `df['date1', 'date2']`. You can build that list dynamically to get what you want - `df[[col for col in df.columns if col.startswith('date')]]`. – lvc Mar 24 '14 at 04:32
  • @lvc Thanks! Thant answered my question! – Zhen Sun Mar 24 '14 at 06:31
  • This is a good question, should reopen it. – joris Mar 24 '14 at 12:04
  • 1
    The shorter answer is to use `df.filter(like='date')` for the `keep date* ` equivalent (but off course, the solution of @lvc works also perfectly) – joris Mar 24 '14 at 12:06
  • @joris, thanks! `df.filter` works beautifully. Just one more thing, if I want to drop all `date*` columns, how should I do it? I tried `like != date` but it does not work. If so, @lvc 's solution may be more flexible as I can use `df.drop` – Zhen Sun Mar 26 '14 at 19:47
  • You can use regex expression inside `filter`, so you can use a regex expression that says to *not* match a certain string, eg `df.filter(regex="^(?!date).*$")`. In the upcoming pandas, this functionality will also be provided in `drop`, so this will be easier. – joris Mar 26 '14 at 22:10

1 Answers1

4

You can use the filter method. To do the equivalent of keep date*:

In [62]: df.filter(like='date')
Out[62]: 
      date1     date2     date3     date4     date5
0  0.091744 -0.431606  1.280286  0.263137  0.444550
1  0.688155  0.583918  0.957041  0.446047  1.654274
2  0.109004  0.608818  0.091498  0.940406  0.476479
3 -0.874016  1.312567  0.326480  1.213292  0.504049
4 -0.203515 -0.979086  0.458790  1.012296 -2.446310

The filter method has also a regex keyword, to do some more complex filtering.
Eg to drop all dates, you can provide a regex expression that says to not match a certain string: df.filter(regex="^(?!date).*$")

In the upcoming pandas (0.14), this functionality will also be provided in drop method, so this will be easier.

joris
  • 133,120
  • 36
  • 247
  • 202