Remove duplicates of pandas df

Question

Trying use the DataFrame.drop_duplicates parameters but without luck as the duplicates are not being removed.

Looking to remove based on column "inc_id". If find duplicates in that column should keep only the last row.

My df is:

    inc_id  inc_cr_date
0   1049670 121
1   1049670 55
2   1049667 121
3   1049640 89
4   1049666 12
5   1049666 25

Output should be:

    inc_id  inc_cr_date
0   1049670 55
1   1049667 121
2   1049640 89
3   1049666 25

Code is:

df = df.drop_duplicates(subset='inc_id', keep="last")

Any idea what am I missing here? Thanks.

Actually not an error, but the df series continues with duplicates. thanks — Gonzalo, Nov 09 '17 at 16:51
Possible duplicate: [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) — Herpes Free Engineer, Jul 04 '18 at 17:07
Possible duplicate of [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) — Herpes Free Engineer, Jul 04 '18 at 17:08

score 4 · Accepted Answer · answered Nov 09 '17 at 16:44

4

I think you are just looking to drop the original index:

In [11]: df.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)
Out[11]:
    inc_id  inc_cr_date
0  1049670           55
1  1049667          121
2  1049640           89
3  1049666           25

answered Nov 09 '17 at 16:44

Andy Hayden

359,921
101
625
535

Doesnt seem to work as the df continues with duplicates. thanks. – Gonzalo Nov 09 '17 at 16:53
@Gonzalo this is running code from your example! What is it your solution does wrong? Can you include the "bad output" from your example in your question? – Andy Hayden Nov 09 '17 at 16:55
@Gonzalo ,assign it back `df=df.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)` – BENY Nov 09 '17 at 17:00
@AndyHayden My bad, it's working your adjustment. I was looking to the wrong output (saving to csvs..)Thanks. – Gonzalo Nov 09 '17 at 17:21

score 1 · Answer 2 · answered Jan 21 '21 at 09:42

For dataframe df, duplicate rows can be dropped using this code.

df = pd.read_csv('./data/data-set.csv')
print(df['text'])

def clean_data(dataframe):
    # Drop duplicate rows
    dataframe.drop_duplicates(subset='text', inplace=True)

clean_data(df)
print(df['text'])

score -1 · Answer 3 · edited Dec 17 '18 at 22:46

-1

f.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)

edited Dec 17 '18 at 22:46

K.Dᴀᴠɪs

9,945
11
33
43

answered Dec 17 '18 at 22:23

marc

1

1

Hello, and welcome to stack overflow. It would be awesome if you could [edit] your answer and expand a bit on it, and share how your code adds to the previous answer, and helps to answer the original question. `f.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)` does appear as part of the accepted answer. – dbc Dec 17 '18 at 22:49

Remove duplicates of pandas df

3 Answers3

Linked

Related