1

I have this dataframe below:

df=pd.DataFrame({'cnpj':[410000132,410000132,4830624000197,4830624000197,4830624000197],'Nome Pessoa':['EUGENIO LUPORINI NETO','JUAN MATIAS SERAGOPIAN','EUGENIO LUPORINI NETO','SIMONE FANKHAUSER','ALEX SOUZA']})
print(df)

            cnpj             Nome Pessoa
0      410000132   EUGENIO LUPORINI NETO
1      410000132  JUAN MATIAS SERAGOPIAN
2  4830624000197   EUGENIO LUPORINI NETO
3  4830624000197       SIMONE FANKHAUSER
4  4830624000197              ALEX SOUZA

Each cnpj is a company. Each Nome Pessoa is a person. I want to list, for each Nome Pessoa to which other persons appear with the same cnpj as him (preferably with no duplicates). In other words, I will be listing how people are related using cnpj as key, in a way that the df looks like this (or at least close to it):

            cnpj             Nome Pessoa   Relations
0      410000132   EUGENIO LUPORINI NETO   ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA']
1      410000132  JUAN MATIAS SERAGOPIAN   ['EUGENIO LUPORINI NETO']
2  4830624000197   EUGENIO LUPORINI NETO   ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA']
3  4830624000197       SIMONE FANKHAUSER   ['EUGENIO LUPORINI NETO','ALEX SOUZA']
4  4830624000197              ALEX SOUZA   ['EUGENIO LUPORINI NETO','SIMONE FANKHAUSER']

For instance, df['Relations'][0] = ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA'] is like that because JUAN MATIAS SERAGOPIAN appears in the same cnpj as EUGENIO LUPORINI NETO (410000132) and SIMONE FANKHAUSER and ALEX SOUZA appear in the other cnpj together with EUGENIO (4830624000197)

I suppose it might be something in the groupby area however not sure how to achieve it.

aabujamra
  • 4,494
  • 13
  • 51
  • 101

2 Answers2

2

You can use apply with a query inside it and append the results to the DataFrame:

def get_relations(row, df):
  row_cnpj = row['cnpj']
  row_name = row['Nome Pessoa']
  query = df.query('cnpj == @row_cnpj and `Nome Pessoa` != @row_name')
  row['Relations'] = query['Nome Pessoa'].values
  return row

df = df.apply(lambda x: get_relations(x, df), axis=1)
Murgalha
  • 368
  • 3
  • 7
1

The following works:

In[0]:

def add_relations(row):
    current_name = row['Nome Pessoa']
    cnpjs = df[df['Nome Pessoa'] == current_name]['cnpj']
    relations = df['cnpj'].isin(cnpjs)
    output = df[relations]['Nome Pessoa'][df['Nome Pessoa'] != current_name]
    return list(output)

df['Relations'] = df.apply(add_relations, axis=1)
df

Out[0]:
            cnpj             Nome Pessoa  \
0      410000132   EUGENIO LUPORINI NETO   
1      410000132  JUAN MATIAS SERAGOPIAN   
2  4830624000197   EUGENIO LUPORINI NETO   
3  4830624000197       SIMONE FANKHAUSER   
4  4830624000197              ALEX SOUZA   

                                           Relations  
0  [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...  
1                            [EUGENIO LUPORINI NETO]  
2  [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...  
3                [EUGENIO LUPORINI NETO, ALEX SOUZA]  
4         [EUGENIO LUPORINI NETO, SIMONE FANKHAUSER]

This uses apply so is not optimal, but may be fine depending on how much data you have.

Update: I tried also making something with groupby, and came up with the following which also works, but also feels not ideal because it uses groupby 2x and has a very nasty list comprehension. I feel there is a nicer answers but it is eluding me....

num_to_name = df.groupby('cnpj')['Nome Pessoa'].apply(list)
name_to_num = df.groupby('Nome Pessoa')['cnpj'].apply(list)

df['Relations'] = df['Nome Pessoa'].map(name_to_num)
df['Relations'] = [[x for x in num_to_name.loc[df.loc[i,'Relations']].values.sum()
                    if x != df.loc[i, 'Nome Pessoa']] for i in df.index]

Thanks to this answer for help with the above.

Tom
  • 8,310
  • 2
  • 16
  • 36