3

My dataframe is something like this:

             userid           codeassigned         timestamp
15           553938              M1           1499371200000
15390        527638              M2           1599731200000
15389        521638              M2           1399901200000
15388        521638              M3           1439841200000
15387        553938              M4           1499521200000

I have taken a subset of this dataframe (user with latest timestamp) by doing:

df = df.sort_values('timestamp', ascending=False)
mask = df.duplicated('userid')
subset_df = df[~mask]

Now, I want all the rows from main dataframe where (userid, timestamp) are in subset_df (there can be multiple rows with same[userid, timestamp] but with different code assigned); for which I'm doing:

subset_df[['userid', 'timestamp']].isin(df)

However, I'm getting this error:

ValueError: cannot compute isin with a duplicate axis.

Any idea what I'm doing wrong ?

Saurabh Verma
  • 6,328
  • 12
  • 52
  • 84

1 Answers1

4

You need merge for inner join with filtered subset:

subset_df = df.loc[~mask, ['userid', 'timestamp']]

df = subset_df.merge(df)

Or:

df = subset_df[['userid', 'timestamp']].merge(df)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Cool ! But can you please put some light on why 'isin' is not working for this case..? – Saurabh Verma Feb 05 '19 at 06:37
  • 3
    @SaurabhVerma - yes, main problem is [`DataFrame.isin`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)` test values only in another DataFrame with same index and columns names, here are different index, so error. – jezrael Feb 05 '19 at 06:41
  • I had a similar problem and the way I thought of `x.isin(y)` was incorrect. I would expect it to mean "Is x in y?", but in fact you should think about it as "Is y in x?" Maybe this was only unclear to me, but it explains why you can flip the arguments of `Dataframe.isin` and see this particular error come and go. – Todd Vanyo Mar 01 '19 at 15:48