1

I want to select several columns from a data frame, based on the column name.

Let's look at the iris dataset, and try to select the sepal columns.

import pandas as pd
import seaborn as sns
iris = sns.load_dataset("iris")
iris.head()

    sepal_length sepal_width petal_length petal_width species
0   5.1          3.5         1.4          0.2         setosa
1   4.9          3.0         1.4          0.2         setosa
2   4.7          3.2         1.3          0.2         setosa
3   4.6          3.1         1.5          0.2         setosa
4   5.0          3.6         1.4          0.2         setosa

I can use string methods on the columns, then use .loc, but it's clunky.

iris.loc[:, iris.columns.str.startswith("sepal")]

The same is true of using the more general .match() method.

iris.loc[:, iris.columns.str.match("^sepal")]

I want something like .startswith() that returns the matched values, not a Boolean array. Then I could write something like

iris[iris.columns.str.STARTSWITH("sepal")]

So far as I can tell, the only relevant string method that will return values rather than a Boolean vector is .extract(), but using that makes the code dramatically less readable.

iris[iris.columns.str.extract("^(sepal.*)").dropna()[0]]

This feels like such a common task that there ought to be a cleaner way of doing it.

What am I missing? Is there a clean, idiomatic way of selecting columns from a dataframe based on the column name?

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360

0 Answers0