0

I have 2 datasets (in CSV format) with different size such as follow:

df_old:

index category  text
 0    spam      you win much money
 1    spam      you are the winner of the game
 2    not_spam  the weather in Chicago is nice
 3    not_spam  pizza is an Italian food
 4    neutral   we have a party now
 5    neutral   they are driving to downtown
 

df_new:

index category  text
 0    spam      you win much money
 14   spam      London is the capital of Canada
 15   not_spam  no more raining in winter
 25   not_spam  the soccer game plays on HBO
 4    neutral   we have a party now
 31   neutral   construction will be done
 

I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.

The code is:

(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort')) 

Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.

The expected output should be:

df_concat:

index category  text
 14   spam      London is the capital of Canada
 0    spam      you win much money
 1    spam      you are the winner of the game
 15   not_spam  no more raining in winter
 25   not_spam  the soccer game plays on HBO
 2    not_spam  the weather in Chicago is nice
 3    not_spam  pizza is an Italian food
 31   neutral   construction will be done
 4    neutral   we have a party now
 5    neutral   they are driving to downtown    

I tried this and this but these are removing either the category or text.

Bilgin
  • 499
  • 1
  • 10
  • 25

4 Answers4

1

Try concat + sort_values:

res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)

Output

   index  category                             text
0      0      spam               you win much money
1     14      spam  London is the capital of Canada
1      1      spam   you are the winner of the game
2     15  not_spam        no more raining in winter
3     25  not_spam     the soccer game plays on HBO
2      2  not_spam   the weather in Chicago is nice
3      3  not_spam         pizza is an Italian food
4     31   neutral        construction will be done
4      4   neutral              we have a party now
5      5   neutral     they are driving to downtown
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
1

Your code seems right , try to add this to the concat result it will remove your duplicates :

# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort')) 

df_concat.drop_duplicates()

If you want to reindex it you can do ofcourse mot chnging the ‘index’column):

    df_concat.drop_duplicates(ignore_index =True)
adir abargil
  • 5,495
  • 3
  • 19
  • 29
  • thanks for the code. I don't want to reindex the index actually. The problem is, my document contains important duplicate data either in ```def_old``` or ```df_new``` and the only thing that I want to remove while concatenation is the one that has the same index and I tried to do with ['index'] but didn't work. Any suggestion? – Bilgin Dec 14 '20 at 21:31
  • The data you showed us is looks like there is a column ‘index’ which is very popular, let me work a little onthe example – adir abargil Dec 14 '20 at 21:41
1

To remove duplicates on specific column(s), use subset in drop_duplicates:

df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Sura-da
  • 301
  • 3
  • 12
  • thanks fpr the comments. I am wondering how can i set the rule like ```if 'index', 'category', 'text' ``` are similar at the same time, then remove them? – Bilgin Dec 14 '20 at 21:32
0

You can always do combine_first

out = df_new.combine_first(df_old)
BENY
  • 317,841
  • 20
  • 164
  • 234