In python 3 and pandas I have two dataframes with the same structure:
data_1 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '0003557-92.2008.4.01.3801', '1009486-37.2017.8.26.0053', '5005742-49.2017.4.04.9999', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA', 'SAO PAULO PREVIDENCIA - SPPREV', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_1 = pd.DataFrame(data_1)
data_2 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '5005742-49.2017.4.04.9999', '1009486-37.2017.8.26.0053', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'SAO PAULO PREVIDENCIA - SPPREV', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_2 = pd.DataFrame(data_2)
The "numero_cnj" column is an identifying key for the same item, but it can be repeated because more than one person/name can refer to that item.
I want to compare the two dataframes by the key "numero_cnj" and create a new dataframe from df_1, but just keeping the rows or keys that are in df_2 but not in df_1 - keep all keys from df_1 that were not found in df_2
For example
df_1
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
4 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
df_2
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
In this case, the new dataframe would have only the line:
0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
Please, does anyone know the best strategy to do this?