0

In python 3 and pandas I have two dataframes with the same structure:

data_1 = {
    'numero_cnj' : ['0700488-61.2018.8.07.0017', '0003557-92.2008.4.01.3801', '1009486-37.2017.8.26.0053', '5005742-49.2017.4.04.9999', '0700488-61.2018.8.07.0017'],
    'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA', 'SAO PAULO PREVIDENCIA - SPPREV', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_1 = pd.DataFrame(data_1) 


data_2 = {
    'numero_cnj' : ['0700488-61.2018.8.07.0017', '5005742-49.2017.4.04.9999', '1009486-37.2017.8.26.0053', '0700488-61.2018.8.07.0017'],
    'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'SAO PAULO PREVIDENCIA - SPPREV', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_2 = pd.DataFrame(data_2) 

The "numero_cnj" column is an identifying key for the same item, but it can be repeated because more than one person/name can refer to that item.

I want to compare the two dataframes by the key "numero_cnj" and create a new dataframe from df_1, but just keeping the rows or keys that are in df_2 but not in df_1 - keep all keys from df_1 that were not found in df_2

For example

df_1
    numero_cnj  nome_normalizado
0   0700488-61.2018.8.07.0017   MARIA DOS REIS DE OLIVEIRA SILVA
1   0003557-92.2008.4.01.3801   MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
2   1009486-37.2017.8.26.0053   SAO PAULO PREVIDENCIA - SPPREV
3   5005742-49.2017.4.04.9999   INSTITUTO NACIONAL DO SEGURO SOCIAL
4   0700488-61.2018.8.07.0017   GERALDO CAVALCANTE DA SILVEIRA

df_2
    numero_cnj  nome_normalizado
0   0700488-61.2018.8.07.0017   MARIA DOS REIS DE OLIVEIRA SILVA
1   5005742-49.2017.4.04.9999   INSTITUTO NACIONAL DO SEGURO SOCIAL
2   1009486-37.2017.8.26.0053   SAO PAULO PREVIDENCIA - SPPREV
3   0700488-61.2018.8.07.0017   GERALDO CAVALCANTE DA SILVEIRA

In this case, the new dataframe would have only the line:

0003557-92.2008.4.01.3801   MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA

Please, does anyone know the best strategy to do this?

Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43
  • 1
    I don't have time to verify right now, but this seems like something you could do with a join. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html – pslessard Jul 15 '20 at 17:52
  • 1
    Do you mean "df_2 but not in df_1" - because the output seems to indicate it's the other way around. – Roy2012 Jul 15 '20 at 17:52
  • Thanks @Roy2012 - keep all keys from df_1 that were not found in df_2 – Reinaldo Chaves Jul 15 '20 at 17:56
  • 1
    Great. See my answer below. – Roy2012 Jul 15 '20 at 17:56
  • 1
    I take back my last comment, a join probably won't do it, but I found a similar question that might be helpful if @Roy2012 's answer doesn't work. https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe – pslessard Jul 15 '20 at 17:57

1 Answers1

1

If I'm reading your question correctly, you should use join (merge) with how=outer:

merge =  pd.merge(df_1, df_2, on = "numero_cnj", suffixes = ["", "_y"],  how = "outer", indicator=True)
merge[merge._merge == "left_only"][["numero_cnj", "nome_normalizado"]]

The output is:

                  numero_cnj                                  nome_normalizado
4  0003557-92.2008.4.01.3801  MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
Roy2012
  • 11,755
  • 2
  • 22
  • 35