21

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.

This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

My first solution involved a for loop and the creation of a help list:

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(one could apply any of the str functions, of course)

Then, I found the map function and got it to work with the following code:

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

Thanks, Michele

EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..

galath
  • 5,717
  • 10
  • 29
  • 41
Michele Ancis
  • 1,265
  • 4
  • 16
  • 29

3 Answers3

41

Select column by partial string, can simply be done, via:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string match, you can pass axis=0 to filter:

df.filter(like='hello', axis=0) 
Philipp Schwarz
  • 18,050
  • 5
  • 32
  • 36
23

Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):

In [1]: df
Out[1]: 
   x  y  z
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9

In [2]: df.columns.to_series().str.contains('x')
Out[2]: 
x     True
y    False
z    False
dtype: bool

In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]: 
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • 4
    Just `df.loc[:, df.columns.str.contains('x')]` also works. – uut Nov 14 '19 at 18:13
  • 1
    @Robert Smith The UPDATE was the most useful. I researched it a little bit more, and I discovered two things. Set `case=False` for a case insensitive search and `regex=True` to use regular expressions. Example: `df.loc[:, df.columns.str.contains('x', case=False, regex=True)]` – Greg G Dec 24 '21 at 06:53
1

I think df.keys().tolist() is the thing you're searching for.

A tiny example:

from pandas import DataFrame as df

d = df({'somename': [1,2,3], 'othername': [4,5,6]})

names = d.keys().tolist()

for n in names:
    print n
    print type(n)

Output:

othername
type 'str'

somename
type 'str'

Then with the strings you got, you can do any string operation you want.

Geeocode
  • 5,705
  • 3
  • 20
  • 34