3

Trying use the DataFrame.drop_duplicates parameters but without luck as the duplicates are not being removed.

Looking to remove based on column "inc_id". If find duplicates in that column should keep only the last row.

My df is:

    inc_id  inc_cr_date
0   1049670 121
1   1049670 55
2   1049667 121
3   1049640 89
4   1049666 12
5   1049666 25

Output should be:

    inc_id  inc_cr_date
0   1049670 55
1   1049667 121
2   1049640 89
3   1049666 25

Code is:

df = df.drop_duplicates(subset='inc_id', keep="last")

Any idea what am I missing here? Thanks.

Gonzalo
  • 1,084
  • 4
  • 20
  • 40

3 Answers3

4

I think you are just looking to drop the original index:

In [11]: df.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)
Out[11]:
    inc_id  inc_cr_date
0  1049670           55
1  1049667          121
2  1049640           89
3  1049666           25
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Doesnt seem to work as the df continues with duplicates. thanks. – Gonzalo Nov 09 '17 at 16:53
  • @Gonzalo this is running code from your example! What is it your solution does wrong? Can you include the "bad output" from your example in your question? – Andy Hayden Nov 09 '17 at 16:55
  • @Gonzalo ,assign it back `df=df.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)` – BENY Nov 09 '17 at 17:00
  • @AndyHayden My bad, it's working your adjustment. I was looking to the wrong output (saving to csvs..)Thanks. – Gonzalo Nov 09 '17 at 17:21
1

For dataframe df, duplicate rows can be dropped using this code.

df = pd.read_csv('./data/data-set.csv')
print(df['text'])

def clean_data(dataframe):
    # Drop duplicate rows
    dataframe.drop_duplicates(subset='text', inplace=True)

clean_data(df)
print(df['text'])
Isurie
  • 310
  • 4
  • 9
-1
f.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)
K.Dᴀᴠɪs
  • 9,945
  • 11
  • 33
  • 43
marc
  • 1
  • 1
    Hello, and welcome to stack overflow. It would be awesome if you could [edit] your answer and expand a bit on it, and share how your code adds to the previous answer, and helps to answer the original question. `f.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)` does appear as part of the accepted answer. – dbc Dec 17 '18 at 22:49