271

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).

I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?

Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
erikfas
  • 4,357
  • 7
  • 28
  • 36

8 Answers8

405

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

  1. df.columns returns a list of column names
  2. [col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9
Alvaro Fuentes
  • 16,937
  • 4
  • 56
  • 68
  • 1
    That's awesome! I don't really understand exactly how it works, though, still being new to both Python and Pandas. Could you perhaps explain? – erikfas Jan 22 '14 at 14:35
  • 27
    this is what ``DataFrame.filter`` does FYI (and you can supply a regex if you want) – Jeff Jan 22 '14 at 14:37
  • 2
    @xndrme how would you do a regex to **exclude** a certain column matching a regex instead of include? – Dhruv Ghulati Mar 31 '16 at 11:28
  • @DhruvGhulati You can do this: `spike_cols = [col for col in df.columns if not 'spike' in col]` to get the list of columns that **do not match**, or you can use a lookahead negative regex like `df2 = df.filter(regex='^(?!.*spike).*$')`. Of course this will always depends of the complexity of your excluding patter. You can read more here https://docs.python.org/2/howto/regex.html#lookahead-assertions – Alvaro Fuentes Mar 31 '16 at 11:42
  • 3
    @DhruvGhulati It is possible also to drop your unwanted columns as in `df[df.columns.drop(spike_cols)]`, there you get a `DataFrame` without the columns in the list `spike_cols` which you can obtain using your undesired regex. – Alvaro Fuentes Mar 31 '16 at 11:54
  • 3
    more concise code: `````df[[col for col in df.columns if "spike" in col]]````` – WindChimes May 22 '16 at 03:11
  • what if I have a list of strings like `['spike', 'foo', 'bar']` ad want all the columns that contain any of those strings? – JacoSolari Feb 02 '21 at 14:07
  • 4
    @JacoSolari `[col for col in df.columns if any(s in col for s in ['spike', 'foo', 'bar'])]` or `df.filter(regex='(spike)|(foo)|(bar)')` – Alvaro Fuentes Feb 03 '21 at 10:30
141

This answer uses the DataFrame.filter method to do this without list comprehension:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:

print(df.filter(regex='spike|spke').columns)

Will output both columns: ['spike-2', 'hey spke']

Community
  • 1
  • 1
Ben
  • 2,308
  • 2
  • 18
  • 25
  • I have many columns and I used this code, it seems it skips some of the names! In this example, imagine running this code and not returning 'hey spke' column!! – PM0087 May 24 '21 at 15:25
  • How about excluding some columns by name? How would we go about doing that? – MrSoLoDoLo Oct 03 '21 at 15:04
  • You could do a negative lookahead (regex='^(?!spke)') or get a boolean vector for columns doing something like df.columns.str.contains('spke'). – Ben Mar 21 '22 at 13:06
47

You can also use df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

This will output the column names: 'spike-2', 'spiked-in'

More about pandas.Series.str.contains.

vasili111
  • 6,032
  • 10
  • 50
  • 80
39
# select columns containing 'spike'
df.filter(like='spike', axis=1)

You can also select by name, regular expression. Refer to: pandas.DataFrame.filter

Manny
  • 679
  • 5
  • 12
17
df.loc[:,df.columns.str.contains("spike")]
DhanushNayak
  • 171
  • 1
  • 3
  • 1
    While it does not exactly answer the original question, I really like this solution as it directly returns the sliced DataFrame (which is actually also what probably the OP is after). – malvoisen Jan 11 '21 at 17:35
9

Another solution that returns a subset of the df with the desired columns:

df[df.columns[df.columns.str.contains("spike|spke")]]

rambutan
  • 199
  • 2
  • 10
4

You also can use this code:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Yury Wallet
  • 1,474
  • 1
  • 13
  • 24
1

Getting name and subsetting based on Start, Contains, and Ends:

# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)
vasili111
  • 6,032
  • 10
  • 50
  • 80