I want to select several columns from a data frame, based on the column name.
Let's look at the iris dataset, and try to select the sepal
columns.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset("iris")
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
I can use string methods on the columns, then use .loc
, but it's clunky.
iris.loc[:, iris.columns.str.startswith("sepal")]
The same is true of using the more general .match()
method.
iris.loc[:, iris.columns.str.match("^sepal")]
I want something like .startswith()
that returns the matched values, not a Boolean array. Then I could write something like
iris[iris.columns.str.STARTSWITH("sepal")]
So far as I can tell, the only relevant string method that will return values rather than a Boolean vector is .extract()
, but using that makes the code dramatically less readable.
iris[iris.columns.str.extract("^(sepal.*)").dropna()[0]]
This feels like such a common task that there ought to be a cleaner way of doing it.
What am I missing? Is there a clean, idiomatic way of selecting columns from a dataframe based on the column name?