We are working on a use case to generate a unique ID (UID)
for the Customers spanning across different systems/data sources. The unique ID will be generated using PII information such as email & phone no.
Problem Statement:
For example a Customer named as John Smith is doing a multiple transaction to purchase an item.
1st Transaction:
In the first transaction he has filled email as john.smith@gmail.com and phone no. as 1234567890
Email Id= john.smith@gmail.com & phone no. = 1234567890
2nd Transaction:
In the second transaction he has changed the email as john.s@gmail.com and phone no. is same i.e. 1234567890
Email Id= john.s@gmail.com & phone no. = 1234567890
3rd Transaction:
In the third transaction he has used his first email id i.e. john.smith@gmail.com and phone no. has been changed to 2234567890
Email Id= john.smith@gmail.com & phone no. = 2234567890
As we know from above 3 transaction John Smith is a single person doing all 3 purchases. Our requirement is to generate a unique ID (UID) for John Smith. In addition this will be continuous process for example lets say after a week, month or 6 month John Smith does a another transaction (4th Transaction)
4th Transaction:
In the fourth transaction he has changed the email to j.smith@gmail.com and kept the same phone no. as 1234567890
Email Id= j.smith@gmail.com & phone no. = 1234567890
In this case the 4th transaction should be tagged with the same unique (UID) generated on the basis of first 3 transaction because he is a same person but different email id this time.
There is also a possibility here that in Transaction 5 John Smith has purchased the item using altogether a new email id and phone no. as shown below. In this case system will generate a new unique UID as email and phone no. is different.
Transaction 5
Email Id= j.s@gmail.com and phone no. =3123456789
and in 6th transaction John has used a new phone no. 3123456789 but old email Id i.e. john.smith@gmail.com
Transaction 6
Email Id= john.smith@gmail.com and phone no. = 3123456789
Now in this case transaction 5
should not generate a new UID
as he is the same person "John Smith" we got to know from transaction 6
by his old email id.
Data Volume:
The data volume we have for such Customers are in 100's GB
.
Current Solution:
we are using Azure Databricks pyspark graph frame to solve this problem. Below is our solution approach.
Code Snippet:
from graphframes import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
customer_df = spark.table('default.customer_details').select(col('email').alias('id'),'phonenumber')
phone_df = customer_df.select(col('phonenumber').alias('id'),col('id').alias('phonenumber'))
final_df = customer_df.union(phone_df).dropDuplicates()
vertex = final_df.select('id')
edge = final_df.select(col('id').alias('src'),col('phonenumber').alias('dst'))
graph = GraphFrame(vertex, edge)
result = graph.stronglyConnectedComponents(maxIter=5)
newDf1 = result.withColumn("component",result.component.cast(StringType()))
hashdf1 = newDf1.withColumn("unique_id_gen", md5(newDf1.component)).dropDuplicates()
ResultComponent from the above snippet will become our unique UID.
Challenges with the current solution:
As we get daily incremental data on all the new purchases made by the customers, we need to process the entire dataset, i.e. History + incremental dataset to generate a unique UID for new customers and assign the previously generated unique UID for the existing customers that have made a new transactions. Pipeline is currently unpredictable every time it fails with new exceptions
We need your recommendation & suggestions to fix this problem statement, we are also open to new design / architecture changes if required.