172

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
ccsv
  • 8,188
  • 12
  • 53
  • 97

11 Answers11

249

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • why does a list comprehension over a dataframe (the `col in df` part) loop over the names of the columns in the dataframe? rather than looping over each column (so that col would be a series)? (i ask because in R the equivalent for loop syntax would loop over the vectors that are the columns). (note that `[col for col in df.columns if col.startswith('foo')]` gives the right output too but makes more sense to me) – Richard DiSalvo Jun 07 '23 at 00:00
105

Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • This is the answer I came here for, which matches the question title. What the OP actually wanted was more like "Best way to select rows with a filter based only on columns starting with x". – scign Sep 04 '20 at 01:34
16

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]


8

In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]
Flavio Sousa
  • 437
  • 4
  • 4
7

You can use the method filter with the parameter like:

df.filter(like='foo')
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
5

You can try the regex here to filter out the columns starting with "foo"

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of 'foo*' column is 1.

Ricky
  • 2,662
  • 5
  • 25
  • 57
  • `*` at the end of regexes does not make sense — we are not looking for `foooooooo`. It seems you wanted `^foo.*` instead. In fact, one can simply remove it, as `df.filter` does not require full match (from beginning to end), so `^foo` will work as well. – Ilya V. Schurov Jun 19 '22 at 12:15
3

Based on @EdChum's answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column's name.

In essence, I replaced .startswith() with .contains().

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
2

Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr's answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won't be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 - as your desired output suggests - you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
Cleb
  • 25,102
  • 20
  • 116
  • 151
1

I do not like that other solutions require us to refer to the DataFrame twice; it might be fine if you have only one frame named df, but this is often not the case (and your actual name might be much longer). Let's abuse pandas indexing capabilities to type less, and make the code more readable. There is nothing stopping us from using something like this:

df.loc[:, columns.startswith('foo')]

Because the indexer can be any Callable. We can then even assign this pseudo-indexer to a variable and use it for multiple frames:

foo_columns = columns.startswith('foo')
df_1.loc[:, foo_columns]
df_2.loc[:, foo_columns]

We can even make it pretty-print:

> foo_columns
<function __main__.PandasIndexer:columns.str.startswith(pat='foo')()>

And we can use any other method of the str accessor, e.g. columns.contains(r'bar\d', regex=True), all while getting useful signatures:

> columns.contains
<function __main__.PandasIndexer:columns.str.contains(pat, case=True, flags=0, na=None, regex=True)>

All with this short magic code:

from pandas import Series
from inspect import signature, Signature


class PandasIndexer:
    def __init__(self, axis_name, accessor='str'):
        """
        Args:
            - axis_name: `columns` or `index`
            - accessor: e.g. `str`, or `dt`
        """
        self._axis_name = axis_name
        self._accessor = accessor
        self._dummy_series = Series(dtype=object)

    def _create_indexer(self, attribute):
        dummy_accessor = getattr(self._dummy_series, self._accessor)
        dummy_attr = getattr(dummy_accessor, attribute)
        name = f'PandasIndexer:{self._axis_name}.{self._accessor}.{attribute}'

        def indexer_factory(*args, **kwargs):
            def indexer(df):
                axis = getattr(df, self._axis_name)
                accessor = getattr(axis, self._accessor)
                method = getattr(accessor, attribute)
                return method(*args, **kwargs)

            bound_arguments = signature(dummy_attr).bind(*args, **kwargs)
            indexer.__qualname__ = (
                name + str(bound_arguments).replace('<BoundArguments ', '')[:-1]
            )
            indexer.__signature__ = Signature()
            return indexer

        indexer_factory.__name__ = name
        indexer_factory.__qualname__ = name
        indexer_factory.__signature__ = signature(dummy_attr)
        return indexer_factory

    def __getattr__(self, attribute):
        return self._create_indexer(attribute)

    def __dir__(self):
        """Make it work with auto-complete in IPython"""
        return dir(getattr(self._dummy_series, self._accessor))


columns = PandasIndexer('columns')
krassowski
  • 13,598
  • 4
  • 60
  • 92
  • I actually like to set it as `Column = PandasIndexer('columns')` so that it i obvious that I am playing with magic behaviour rather than using a global variable, as in `df.loc[:, Column.startswith('foo')]`; this makes it also reminiscent of SQLAlchemy (and intuitive to those who used such an ORM). – krassowski May 10 '21 at 13:13
1

Even you can try this for multiple prefix:

temp = df.loc[:, df.columns.str.startswith(('prefix1','prefix2','prefix3'))]
0

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
Robbie Liu
  • 1,511
  • 1
  • 11
  • 16