1

How can one use a logical index (or any other efficient method) to select columns for which the column name contains a certain match to a regular expression.

For example,

raw = '''   id 0_date 0_hr 1_date 1_hr
1 a 21-Jan      30 2-Mar      75
'''

import pandas as pd
from StringIO import StringIO
df = pd.read_table(StringIO(raw),header=0,index_col=[0],sep="\s+")

I would like to create a new dataframe with only the id column and all columns that contain the string "date". I was not able to use str.contains on df.columns. It seems the filter function works, but I wanted to create the logical index if that is a valid method. Thanks.

smci
  • 32,567
  • 20
  • 113
  • 146
julieth
  • 430
  • 4
  • 9
  • Duplicate of [pandas: best way to select all columns whose names start with X](https://stackoverflow.com/questions/27275236/pandas-best-way-to-select-all-columns-starting-with-x/27275479#27275479), [Create a subset of a DataFrame depending on column name](https://stackoverflow.com/questions/20903071/create-a-subset-of-a-dataframe-dependant-on-column-name/20903553#20903553) and many others ... – smci Aug 17 '18 at 23:16

3 Answers3

3

You could use '^id$|date' regex expression with df.filter()

In [28]: df.filter(regex='^id$|date')
Out[28]: 
  id  0_date 1_date
1  a  21-Jan  2-Mar
Zero
  • 74,117
  • 18
  • 147
  • 154
1

Is it what you want to do ?

selected_columns = ['id'] + [e for e in df.columns if 'date' in e]
new_df = df[selected_columns]
DavidK
  • 2,495
  • 3
  • 23
  • 38
  • It is. Thanks. In R, we could write something such as grepl("date",colnames(df)), which creates a logical index by which we subset. Maybe that same logic does not apply here, although I have seen people do that for selecting rows. – julieth May 25 '14 at 01:07
1

If you convert the column Index object to a series, you can use .str to perform vectorized string operations (like regex searches):

>>> df.columns
Index([u'id', u'0_date', u'0_hr', u'1_date', u'1_hr'], dtype='objec
>>> df.columns.to_series().str
<pandas.core.strings.StringMethods object at 0xa2b56cc>
>>> df.columns.to_series().str.contains("date")
id        False
0_date     True
0_hr      False
1_date     True
1_hr      False
dtype: bool
>>> df.loc[:, df.columns.to_series().str.contains("date")]
   0_date 1_date
1  21-Jan  2-Mar

In this case, I might use endswith:

>>> df.loc[:, df.columns.to_series().str.endswith("date")]
   0_date 1_date
1  21-Jan  2-Mar

(Personally, I think Index objects should grow a .str which is basically .to_series().str, to make this a little cleaner.)

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Great. So this answers my question directly. Just to develop my understanding, is this strategy a good one to use in pandas/python or would you have gone with a different strategy for this task? – julieth May 25 '14 at 01:26
  • @julieth: it's a perfectly cromulent idiom. But don't discount listcomp approaches either-- they're often useful too. – DSM May 25 '14 at 05:19