1

I am working on this dataset looks very similar like below where,

transaction_id   customer_id   phone           email
             1            19   12345   123@email.com          
             2            19   00001   245@gmail.com
             3         Guest   00001   123@email.com
             4            22   12345   123@email.com
             5            23   78900   678@gmail.com     

The customers under 19, Guest and 22 are actually the same, according to the similar info used in columns phone and email.

As long as, the customer ids for the customer are not unique, my goal is to find similar rows and assign a new unique customer id (to create a new unique customer_id column).

trans_id   cust_id   phone           email  unique_id
       1        19   12345   123@email.com          1        
       2        19   00001   245@gmail.com          1
       3     Guest   00001   123@email.com          1
       4        22   12345   123@email.com          1
       5        23   78900   678@gmail.com          2

The complicated side is, I can groupby email, or I can groupby email and phone. But I couldn’t grasp all rows, for example transaction number 2 is always being assigned as other unique customer id. I tried this code.

 df['unique_id'] = df.groupby(‘phone’).grouper.group_info[0] 

I greatly appreciate your time and help.

Pythonista
  • 79
  • 1
  • 6

1 Answers1

1

This seems like a network problem, which can be solved with the help of networkx. We need to form the network of all cust_ids that are linked by either email or phone.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# Generate group numbers for unique phones and e-mails
df['p_gp'] = df.groupby('phone').ngroup()
df['e_gp'] = df.groupby('email').ngroup()

# This will create all pair-wise combinations customer_ids within the same `gp`
def gen_pairs(df, gp_col):
    nwk = df[['customer_id', gp_col]].merge(df[['customer_id', gp_col]], on=gp_col).drop(columns=gp_col)
    # Removes duplicates, not necessary and slow
    #nwk = nwk.assign(dummy = nwk.apply(frozenset,1)).drop_duplicates('dummy').drop(columns='dummy')
    return nwk

# All pair-wise combinations of either e-mail or phone
dfn = pd.concat([gen_pairs(df, 'p_gp'), gen_pairs(df, 'e_gp')])

# Create the graph
G = nx.from_pandas_edgelist(dfn, source='customer_id_x', target='customer_id_y')

# Visualize which users are linked:
ax,fig = plt.subplots(figsize=(4,4))
nx.draw(G, node_size=30, with_labels=True, font_size=15, edge_color='#1f77b4')
plt.draw()

enter image description here

We can get the separate groups and create a dictionary to map to a unique ID.

l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]
#[['19', '22', 'Guest'], ['23']]

d = dict((k, i) for i in range(len(l)) for k in l[i])
#{'19': 0, '22': 0, '23': 1, 'Guest': 0}

# Finally map the customer_id with the dictionary
df['unique_id'] = df.customer_id.map(d)

  transaction_id customer_id  phone          email  p_gp  e_gp  unique_id
0              1          19  12345  123@email.com     1     0          0
1              2          19  00001  245@gmail.com     0     1          0
2              3       Guest  00001  123@email.com     0     0          0
3              4          22  12345  123@email.com     1     0          0
4              5          23  78900  678@gmail.com     2     2          1
ALollz
  • 57,915
  • 7
  • 66
  • 89