How to compare two dataframes by a key and create a new one, but just keeping the keys that are not in the first?

Question

In python 3 and pandas I have two dataframes with the same structure:

data_1 = {
    'numero_cnj' : ['0700488-61.2018.8.07.0017', '0003557-92.2008.4.01.3801', '1009486-37.2017.8.26.0053', '5005742-49.2017.4.04.9999', '0700488-61.2018.8.07.0017'],
    'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA', 'SAO PAULO PREVIDENCIA - SPPREV', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_1 = pd.DataFrame(data_1) 


data_2 = {
    'numero_cnj' : ['0700488-61.2018.8.07.0017', '5005742-49.2017.4.04.9999', '1009486-37.2017.8.26.0053', '0700488-61.2018.8.07.0017'],
    'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'SAO PAULO PREVIDENCIA - SPPREV', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_2 = pd.DataFrame(data_2)

The "numero_cnj" column is an identifying key for the same item, but it can be repeated because more than one person/name can refer to that item.

I want to compare the two dataframes by the key "numero_cnj" and create a new dataframe from df_1, but just keeping the rows or keys that are in df_2 but not in df_1 - keep all keys from df_1 that were not found in df_2

For example

df_1
    numero_cnj  nome_normalizado
0   0700488-61.2018.8.07.0017   MARIA DOS REIS DE OLIVEIRA SILVA
1   0003557-92.2008.4.01.3801   MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
2   1009486-37.2017.8.26.0053   SAO PAULO PREVIDENCIA - SPPREV
3   5005742-49.2017.4.04.9999   INSTITUTO NACIONAL DO SEGURO SOCIAL
4   0700488-61.2018.8.07.0017   GERALDO CAVALCANTE DA SILVEIRA

df_2
    numero_cnj  nome_normalizado
0   0700488-61.2018.8.07.0017   MARIA DOS REIS DE OLIVEIRA SILVA
1   5005742-49.2017.4.04.9999   INSTITUTO NACIONAL DO SEGURO SOCIAL
2   1009486-37.2017.8.26.0053   SAO PAULO PREVIDENCIA - SPPREV
3   0700488-61.2018.8.07.0017   GERALDO CAVALCANTE DA SILVEIRA

In this case, the new dataframe would have only the line:

0003557-92.2008.4.01.3801   MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA

Please, does anyone know the best strategy to do this?

I don't have time to verify right now, but this seems like something you could do with a join. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html — pslessard, Jul 15 '20 at 17:52
Do you mean "df_2 but not in df_1" - because the output seems to indicate it's the other way around. — Roy2012, Jul 15 '20 at 17:52
Thanks @Roy2012 - keep all keys from df_1 that were not found in df_2 — Reinaldo Chaves, Jul 15 '20 at 17:56
I take back my last comment, a join probably won't do it, but I found a similar question that might be helpful if @Roy2012 's answer doesn't work. https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe — pslessard, Jul 15 '20 at 17:57

score 1 · Accepted Answer · answered Jul 15 '20 at 17:55

1

If I'm reading your question correctly, you should use join (merge) with how=outer:

merge =  pd.merge(df_1, df_2, on = "numero_cnj", suffixes = ["", "_y"],  how = "outer", indicator=True)
merge[merge._merge == "left_only"][["numero_cnj", "nome_normalizado"]]

The output is:

                  numero_cnj                                  nome_normalizado
4  0003557-92.2008.4.01.3801  MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA

answered Jul 15 '20 at 17:55

Roy2012

11,755
2
22
35

Let me know if this answers your question. – Roy2012 Jul 15 '20 at 18:25
Thank you, everything seems correct. I tested it on a much larger dataframe – Reinaldo Chaves Jul 15 '20 at 18:55

How to compare two dataframes by a key and create a new one, but just keeping the keys that are not in the first?

1 Answers1