How to compare two data row before concatenating them?

Question

I have 2 datasets (in CSV format) with different size such as follow:

df_old:

index category  text
 0    spam      you win much money
 1    spam      you are the winner of the game
 2    not_spam  the weather in Chicago is nice
 3    not_spam  pizza is an Italian food
 4    neutral   we have a party now
 5    neutral   they are driving to downtown

df_new:

index category  text
 0    spam      you win much money
 14   spam      London is the capital of Canada
 15   not_spam  no more raining in winter
 25   not_spam  the soccer game plays on HBO
 4    neutral   we have a party now
 31   neutral   construction will be done

I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.

The code is:

(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))

Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.

The expected output should be:

df_concat:

index category  text
 14   spam      London is the capital of Canada
 0    spam      you win much money
 1    spam      you are the winner of the game
 15   not_spam  no more raining in winter
 25   not_spam  the soccer game plays on HBO
 2    not_spam  the weather in Chicago is nice
 3    not_spam  pizza is an Italian food
 31   neutral   construction will be done
 4    neutral   we have a party now
 5    neutral   they are driving to downtown

I tried this and this but these are removing either the category or text.

Dani Mesejo · Answer 1 · 2020-12-14T19:49:34.180

Try concat + sort_values:

res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)

Output

   index  category                             text
0      0      spam               you win much money
1     14      spam  London is the capital of Canada
1      1      spam   you are the winner of the game
2     15  not_spam        no more raining in winter
3     25  not_spam     the soccer game plays on HBO
2      2  not_spam   the weather in Chicago is nice
3      3  not_spam         pizza is an Italian food
4     31   neutral        construction will be done
4      4   neutral              we have a party now
5      5   neutral     they are driving to downtown

adir abargil · Answer 2 · 2020-12-14T21:45:10.050

1

Your code seems right , try to add this to the concat result it will remove your duplicates :

# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort')) 

df_concat.drop_duplicates()

If you want to reindex it you can do ofcourse mot chnging the ‘index’column):

    df_concat.drop_duplicates(ignore_index =True)

edited Dec 14 '20 at 21:45

answered Dec 14 '20 at 19:42

adir abargil

5,495
3
19
29

thanks for the code. I don't want to reindex the index actually. The problem is, my document contains important duplicate data either in ```def_old``` or ```df_new``` and the only thing that I want to remove while concatenation is the one that has the same index and I tried to do with ['index'] but didn't work. Any suggestion? – Bilgin Dec 14 '20 at 21:31
The data you showed us is looks like there is a column ‘index’ which is very popular, let me work a little onthe example – adir abargil Dec 14 '20 at 21:41

score 1 · Answer 3 · answered Dec 14 '20 at 19:46

1

To remove duplicates on specific column(s), use subset in drop_duplicates:

df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')

answered Dec 14 '20 at 19:46

Sura-da

301
3
12

thanks fpr the comments. I am wondering how can i set the rule like ```if 'index', 'category', 'text' ``` are similar at the same time, then remove them? – Bilgin Dec 14 '20 at 21:32

score 0 · Answer 4 · answered Dec 14 '20 at 19:44

0

You can always do combine_first

out = df_new.combine_first(df_old)

answered Dec 14 '20 at 19:44

BENY

317,841
20
164
234

This does not match the expected output – Dani Mesejo Dec 14 '20 at 19:57

How to compare two data row before concatenating them?

4 Answers4