pandas How to store rows dropped using `drop_duplicates`?

Question

Note: See EDIT below.

I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:

df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)

I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:

df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])

But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.

EDIT

I rewrote my question here to be more precise about the result I want.

I start with a df that has one dupe row:

import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)

# Output:
       col1      col2 col3
0  whatever  dupe row    x
1     idx 1  uniq row  NaN
2  sth diff  dupe row    x

I then implement the solution from jjp:

df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])

I print the results:

print(df_keep)
# Output:
       col1      col2 col3
0  whatever  dupe row    x
1     idx 1  uniq row  NaN

df_keep is what I expect and want.

print(df_droplog)
# Output:
       col1      col2 col3
0  whatever  dupe row    x
1     idx 1  uniq row  NaN
2  sth diff  dupe row    x
2  sth diff  dupe row    x

df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.

What I want:

print(df_droplog)
# Output:
       col1      col2 col3
2  sth diff  dupe row    x

jpp · Answer 1 · 2019-02-01T07:17:30.160

5

There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:

df_droplog = pd.DataFrame()

mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]

df_droplog = df_droplog.append(df.loc[mask])

edited Feb 01 '19 at 07:17

answered Jan 31 '19 at 23:14

jpp

159,742
34
281
339

Thanks, but I'm getting funny results. I started w/ a 3-row df, 2 dupes. `df_jobs_by_user` is good, but `df_droplog` has 4 rows, not 1. Expected? **0** import pd, np **1** `df = pd.DataFrame([['a', 'b', np.nan], ['c', 'd', 'a'], ['a', 'b', np.nan]], columns=['col1', 'col2', 'col3'])` **2** `df_droplog = pd.DataFrame()` **3** `mask = df.duplicated(subset=['col2', 'col3'], keep='first')` **4** `df_jobs_by_user = df.loc[~mask]` **5** `df_droplog = df.append(df.loc[mask])` **6** `df_droplog` ``` col1 col2 col3 0 a b NaN 1 c d a 2 a b NaN 2 a b NaN``` – Karl Baker Feb 01 '19 at 00:00
I edited my original question to show the result I want. – Karl Baker Feb 01 '19 at 05:29

score 2 · Accepted Answer · answered Feb 01 '19 at 05:36

2

Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,

df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])

answered Feb 01 '19 at 05:36

Raunaq Jain

917
7
13

Thank you for spotting that and letting me know! Yes, now I see and now it works. Much thanks also for not just telling me "you're updating the wrong df" but also taking the time to post the correct code. Very helpful as I had been thinking the issue was somewhere else. – Karl Baker Feb 01 '19 at 06:08
@jpp The issue was with `df.append`. Check out what he wants. According to that, the solution is `df_droplog.append`. – Raunaq Jain Feb 01 '19 at 07:16
@jpp I didn't think it to be a typo. Honestly. – Raunaq Jain Feb 01 '19 at 07:54
Reposting someone else's answer with a one word fix as a separate post is definitely going to net you points but is not going to make you many friends. – cs95 Feb 04 '19 at 03:31
@coldspeed Absolutely true and I always avoid doing it too. This time, @jpp's explanation was about the function `duplicated` while the solution was wrong. I didn't think deeply and just answered it. if removing my post helps, I am willing to go ahead and do it. – Raunaq Jain Feb 04 '19 at 06:52

pandas How to store rows dropped using `drop_duplicates`?

2 Answers2