1

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.

While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.

My main dataframe is df, and my other dataframe with the ID's in it is called ID:

def groups():
    if df['owner_id'] not in ID['owner_id']:
        return True
    return False

This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:

df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)

Result:

TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')

It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.

I have two questions:

  1. Is my proposed method the best way of going about this?
  2. How can I fix the error that I'm seeing?

My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.

Upasana Mittal
  • 2,480
  • 1
  • 14
  • 19
  • Possible duplicate of [How to implement 'in' and 'not in' for Pandas dataframe](https://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe) – ALollz Aug 09 '18 at 17:33
  • error is because you are trying to insert `data frame` into a `series` object – Upasana Mittal Aug 09 '18 at 17:35
  • `~df['owner_id'].isin(ID['owner_id'].unique())` will give you your boolean Series – ALollz Aug 09 '18 at 17:35

1 Answers1

1

Expression to keep only these rows in df that match owner_id in ID:

df = df[df['owner_id'].isin(ID['owner_id'])]

Lambda expression is going to be way slower that this.

isin is the Pandas way. not in is the Python collections way.

The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Marcin
  • 4,080
  • 1
  • 27
  • 54