0

I am currently working on different data frames that I should merge. One of my data frame has many duplicates on my key of merge variable, so I used drop.duplicate to remove them. Later checked the shape of my data frame before (it had 531 rows) and after (167 rows). So I supposed it worked!
But by using value.counts[key of merge], it doesn't return 1 for each entry of my key of merge variable. How could I explain this, and correct it?

For better understanding, here is my code :

df_stores.drop_duplicates(subset = 'Store ID', keep = 'first' )

df_stores['Store ID'].value_counts().sort_index(ascending=True)
Giorgi Gvimradze
  • 1,714
  • 1
  • 17
  • 34
  • 4
    Could you share your code please? We have to blind guess if you don't. – Celius Stingher Feb 26 '20 at 19:29
  • 2
    Welcome to stack overflow! We ask that you provide a [mcve] for your issue, including sample input, sample output, and code for what you've tried so far – G. Anderson Feb 26 '20 at 19:29
  • drop.duplicates will remove the rows if all the columns are duplicating. Check in the cases where Count > 1 if any one data point is different. I am guessing!!! – Vikika Feb 26 '20 at 19:31
  • I agree with @Vikika you are probably missing the `susbet` parameter, but I won't provide an answer until I can see the code. – Celius Stingher Feb 26 '20 at 19:32
  • Sorry, here is my code : df_stores.drop_duplicates(subset = 'Store ID', keep = 'first' ) df_stores['Store ID'].value_counts().sort_index(ascending=True) – Grison Mayliss Feb 26 '20 at 19:43
  • Did you take a look at this question https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas. Also try to provide a sample dataset in these cases. Possible duplicate question – Rahul Khanna Feb 26 '20 at 19:44
  • And this is what value.counts returns : 1400 4 1401 4 1402 4 1403 4 1404 4 .. 75001 4 75002 4 75003 4 75016 1 75098 2 Name: Store ID, Length: 167, dtype: int64 – Grison Mayliss Feb 26 '20 at 19:44
  • @RahulKhanna Thank you ! After reading : indeed, I need to keep one of my duplicate (the first one). I don't understand the inplace parameter – Grison Mayliss Feb 26 '20 at 19:49
  • Inplace parameter is used when you want to persist your changes to the same dataframe as original. When you use this param, the original dataframe will be overwritten by the operation you have performed. – Rahul Khanna Feb 26 '20 at 19:56
  • @RahulKhanna I manage to fix my problem, by adding the parameter inplace = True. I understand inplace now and its importance thanks to your answer ! Many thanks :) – Grison Mayliss Feb 26 '20 at 19:58

1 Answers1

1

Just so it is easilty accessible for others. I am writing the answer There are two ways:

1. df_stores.drop_duplicates(subset = 'Store ID', keep = 'first', inplace= True)

Note: Do not use it everywhere as it throws warning in some cases

2. df_stores = df_stores.drop_duplicates(subset = 'Store ID', keep = 'first')

Rahul Khanna
  • 316
  • 3
  • 12