How to remove efficiently all duplicates in dataframe or csv file in python?

Question

I have the table below contained in mytest.csv as below :

timestamp   val1    val2    user_id  val3  val4    val5    val6
01/01/2011  1   100 3    5     100     3       5
01/02/2013  20  8        6     12      15      3
01/07/2012      19  57   10    9       6       6        
01/11/2014  3100    49  6        12    15      3
21/12/2012          240  30    240     30       
01/12/2013          63                  
01/12/2013  3200    51  63       50

The above was obtained using the following code in which I tried to remove all duplicates but unfortunately some remained (based on 'timestamp' and 'user_id'):

import pandas as pd

newnames = ['timestamp', 'val1', 'val2','val3', 'val4','val5', 'val6','user_id']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True) 
df = df.loc[:,['timestamp', 'user_id', 'val1', 'val2','val3', 'val4','val5', 'val6']]
df_clean = df.drop_duplicates().fillna(0)

Also, I would like to know how I can efficiently remove all duplicate from the data (pre-processing) and if I should do this before reading it into a dataframe. For example the two last rows are considered duplicates and only the last one which do not contain empty val1 (val1 = 3200) should remain in the dataframe.

Thanks in advance for your help.

Possible duplicate of [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) — Herpes Free Engineer, Jul 04 '18 at 17:09

joris · Accepted Answer · 2016-12-05T10:44:53.927

9

If you want to drop duplicates based on specific columns, you can use the subset argument (older pandas versions: cols) in drop_duplicates:

df_clean = df.drop_duplicates(subset=['timestamp', 'user_id'])

edited Dec 05 '16 at 10:44

answered Apr 04 '14 at 15:31

joris

133,120
36
247
202

Is it possible to equally delete rows for which val1 is nan or equal zero? – Space Apr 04 '14 at 15:39
1

Do you mean something like `df.dropna(subset=['val1'])`? – joris Apr 04 '14 at 15:42
Will that delete the entire row? – Space Apr 04 '14 at 15:46
Yes, this deletes rows where there is a NaN value in the `val1` column. What do you mean with 'equally delete rows for which val1 is nan'? – joris Apr 04 '14 at 15:48
The same than you : 'deletes rows where there is a NaN value in the val1 column'. Thanks a lot. – Space Apr 04 '14 at 16:01
1

`cols` doesn't work anymore since v0.18, it was replaced by `subset` – Nicomak Dec 01 '16 at 14:33

How to remove efficiently all duplicates in dataframe or csv file in python?

1 Answers1