Selecting columns whose names match regex

Question

How can one use a logical index (or any other efficient method) to select columns for which the column name contains a certain match to a regular expression.

For example,

raw = '''   id 0_date 0_hr 1_date 1_hr
1 a 21-Jan      30 2-Mar      75
'''

import pandas as pd
from StringIO import StringIO
df = pd.read_table(StringIO(raw),header=0,index_col=[0],sep="\s+")

I would like to create a new dataframe with only the id column and all columns that contain the string "date". I was not able to use str.contains on df.columns. It seems the filter function works, but I wanted to create the logical index if that is a valid method. Thanks.

Duplicate of [pandas: best way to select all columns whose names start with X](https://stackoverflow.com/questions/27275236/pandas-best-way-to-select-all-columns-starting-with-x/27275479#27275479), [Create a subset of a DataFrame depending on column name](https://stackoverflow.com/questions/20903071/create-a-subset-of-a-dataframe-dependant-on-column-name/20903553#20903553) and many others ... — smci, Aug 17 '18 at 23:16

score 3 · Answer 1 · answered Mar 24 '21 at 09:29

3

You could use '^id$|date' regex expression with df.filter()

In [28]: df.filter(regex='^id$|date')
Out[28]: 
  id  0_date 1_date
1  a  21-Jan  2-Mar

answered Mar 24 '21 at 09:29

Zero

74,117
18
147
154

score 1 · Answer 2 · answered May 25 '14 at 00:49

1

Is it what you want to do ?

selected_columns = ['id'] + [e for e in df.columns if 'date' in e]
new_df = df[selected_columns]

answered May 25 '14 at 00:49

DavidK

2,495
3
23
38

It is. Thanks. In R, we could write something such as grepl("date",colnames(df)), which creates a logical index by which we subset. Maybe that same logic does not apply here, although I have seen people do that for selecting rows. – julieth May 25 '14 at 01:07

score 1 · Accepted Answer · answered May 25 '14 at 01:18

If you convert the column Index object to a series, you can use .str to perform vectorized string operations (like regex searches):

>>> df.columns
Index([u'id', u'0_date', u'0_hr', u'1_date', u'1_hr'], dtype='objec
>>> df.columns.to_series().str
<pandas.core.strings.StringMethods object at 0xa2b56cc>
>>> df.columns.to_series().str.contains("date")
id        False
0_date     True
0_hr      False
1_date     True
1_hr      False
dtype: bool
>>> df.loc[:, df.columns.to_series().str.contains("date")]
   0_date 1_date
1  21-Jan  2-Mar

In this case, I might use endswith:

>>> df.loc[:, df.columns.to_series().str.endswith("date")]
   0_date 1_date
1  21-Jan  2-Mar

(Personally, I think Index objects should grow a .str which is basically .to_series().str, to make this a little cleaner.)

Great. So this answers my question directly. Just to develop my understanding, is this strategy a good one to use in pandas/python or would you have gone with a different strategy for this task? — julieth, May 25 '14 at 01:26
@julieth: it's a perfectly cromulent idiom. But don't discount listcomp approaches either-- they're often useful too. — DSM, May 25 '14 at 05:19

Selecting columns whose names match regex

3 Answers3