3

I'm trying to understand the .filter() method in Pandas. I'm not sure why the below code doesn't work:

# Load data
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Set arbitrary index (is this needed?) and try filtering:
indexed_df = df.copy().set_index('sepal width (cm)')
test = indexed_df.filter(lambda x: x['petal length (cm)'] > 1.4)

I get:

TypeError: 'function' object is not iterable

I appreciate there are simpler ways to do this (e.g. Boolean indexing) but I'm trying to understand for learning purposes why filter fails here when it works for a groupby as shown below:

This works:

 filtered_df = df.groupby('petal width (cm)').filter(lambda x: x['sepal width (cm)'].sum() > 50)
User123456789
  • 151
  • 1
  • 1
  • 9
  • The documentation where you link to has four arguments: `items`, `like`, `regex` and `axis`. None of the (if you read the documentation) accepts a function/lambda expression. – Willem Van Onsem Jan 17 '18 at 15:44
  • `filter` is for selecting columns based on partial matches and regex matches on the column names. – cs95 Jan 17 '18 at 15:44
  • You should just be using plain ol' boolean indexing. – cs95 Jan 17 '18 at 15:45
  • Thank you Willem (and others). I can happily do via Boolean indexing - the sole reason I asked is that it was an example from a DataCamp course, albeit using `groupby` and then `filter` with a `lambda` function. This part is still unclear to me as it works with a `groupby` - I will edit the question to make this explicit. – User123456789 Jan 17 '18 at 16:05
  • 1
    To be clear, this is not an exact duplicate of a Boolean indexing question, it's about why `filter` works with a `groupby` and not without. – User123456789 Jan 17 '18 at 16:21
  • @maw501 `DataFrame.filter` and `groupby.filter` are very different methods. Yes it is unfortunate that they have the same name but that's the only thing in common. You shouldn't compare them. – ayhan Jan 18 '18 at 21:14
  • Goodness. I hadn't realised there was a `groupby.filter` - thanks! Maybe make that the answer? Thank you again. – User123456789 Jan 18 '18 at 21:16
  • NOT A DUPLICATE... Is there a way to filter a DataFrame using a lambda? – Alex R Nov 25 '20 at 04:45

1 Answers1

0

You can use the condition indexed_df['petal length (cm)'] > 1.4 (here we use indexed_df, not x) as a way to filter the dataframe, so:

indexed_df[indexed_df['petal length (cm)'] > 1.4]

How does this work?

If you perform indexed_df['petal length (cm)'] you obtain the "column" of the dataframe: some sort of sequence where for every index, we get the value of that column. By performing a column > 1.4, we obtain some sort of column of booleans: True if the condition is met for a certain row, and False otherwise.

We then can use such boolean column as an element for the dataframe indexed_df[boolean_column] to obtain only the rows where the corresponding row of the boolean_column is True.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Thanks but as stated above this doesn't clear up why the lambda function works when using with `groupby` as now included in the edited answer. – User123456789 Jan 18 '18 at 21:09