Criss Cross Address Deduplication

Question

I have a db table persons in which only persons details are captured. Say Name, Father Name, Email, DOB, Proof of Address, Proof of Identity ,Pincode etc. and I have an address table in which address of the persons are stored say Address, Pincode, City, Province. Persons and Address have a foreign key relationship with column addressId(persons) and Id(address).

NOTE: Using Postgres

Now what I have to do say I am given with some proof of address id and I have to find how much such persons with same POA or POI are there and with different id if somewhere Name, Father Name and Address (all three matches).

Contacts Schema

Address Schema

ID	Address	Pincode	City	Province
ad_1	mini market 850 vijay street	110056	XYZ	ABC
ad_2	45 street south coast near Bank	285006	MNO	DEF
ad_3	45 street south coast near Bank	285006	MNO	DEF

E.g Input POA Id = poi_1

considering input poa id, we can do direct string matching query on DB.

select * from persons where poi_id = 'poi_1' or poa_id = 'poi_1';

The result is person records with Id 1,2. Now if you observe person 2 has same poi/poa id but address is different. Now we have to consider this address and do dedupe based on this address based on which person with id 3 comes in response. (Name,Fathername and address all 3 must match)

I need an optimised way to resolve the same. If I do like query on name and father name only, It will do a full db scan which is not feasible at all. Same goes for address like query.

My thought:

select * from addressTable where address like = %input address% and pincode = '285006';

I haven't thought how should i proceed bcz everytime I stuck on what basis I should do a query so that it is optimised.

**Note:**

indexes mentioned below 
"person_new_dob_idx" btree (dob) 
"person_new_email_idx" btree (email) 
"person_fst_name_idx" btree (name) 
"person_poa_idx" btree (poa_id) 
"person_new_poi_idx" btree (poi_id) 
"person_addr_id_idx" btree (addr_id) 
"person_dob_primary_pincode_idx" btree (dob, primary_pincode) 

Postgres version:-- 13.4

Points to consider:

In the mentioned example pincode is different, but there might be possibility of different address within same pincode.
Do care for false positives and true negatives.
Address words might be shuffled or written same address in different way.

I suggest you read these: [Provide a `Minimal Complete Verifiable Example` (MCVE)](https://stackoverflow.com/help/mcve) and [Why should I provide a MCVE](https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query) — Paul Maxwell, Aug 13 '23 at 04:57
Long text descriptions are a very poor substitute for actual table DDL, including indexes. An example of a working query (even if inefficient) is more helpful than just a text description. Begin with a working query and then look for possible optimizations. The explain plan for a working query should foster valuable insite into any performance issues. — JohnH, Aug 13 '23 at 05:57
@PaulMaxwell Modified question and explained with an example. — Akhilesh mahajan, Aug 13 '23 at 06:49
Although a schema can be inferred from the sample data, it is still not as useful as actual table DDL, especially since there are no index details. It would also be useful to know which release of PostgreSQL is to be used. — JohnH, Aug 13 '23 at 08:08
@JohnH indexes mentioned below "person_new_dob_idx" btree (dob) "person_new_email_idx" btree (email) "person_fst_name_idx" btree (name) "person_poa_idx" btree (poa_id) "person_new_poi_idx" btree (poi_id) "person_addr_id_idx" btree (addr_id) "person_dob_primary_pincode_idx" btree (dob, primary_pincode) INVALID Access method: heap Postgres version: server_version ---------------- 13.4 — Akhilesh mahajan, Aug 13 '23 at 09:08
Thanks for added details, it has resulted in an effective answer already. However in future don't include DDL or code in comments. Instead such detail should be included in the question. — Paul Maxwell, Aug 13 '23 at 12:32

score 2 · Answer 1 · answered Aug 13 '23 at 09:51

I'll just start by leaving these here for reference in case, my answer didn't hit home.

-- Create Address Table
CREATE TABLE address (
    Id SERIAL PRIMARY KEY,
    Address TEXT NOT NULL,
    Pincode TEXT NOT NULL,
    City TEXT NOT NULL,
    Province TEXT NOT NULL
);

-- Create Person Table
CREATE TABLE persons (
    Id SERIAL PRIMARY KEY,
    Name TEXT NOT NULL,
    Father_Name TEXT NOT NULL,
    Email TEXT NOT NULL,
    DOB DATE NOT NULL,
    POA_Id TEXT NOT NULL,
    POI_Id TEXT NOT NULL,
    Pincode TEXT NOT NULL,
    AddressId INTEGER REFERENCES address(Id)
);

-- Insert statements for data provided
INSERT INTO address (Address, Pincode, City, Province)
VALUES
    ('mini market 850 vijay street', '110056', 'XYZ', 'ABC'),
    ('45 street south coast near Bank', '285006', 'MNO', 'DEF'),
    ('45 street south coast near Bank', '285006', 'MNO', 'DEF');

INSERT INTO persons (Name, Father_Name, Email, DOB, POA_Id, POI_Id, Pincode, AddressId)
VALUES
    ('PN1', 'FN1', 'pn1.fn1@d1.com', '2000-01-01', 'poi_1', 'poa_2', '110056', 1),
    ('PN1', 'FN1', 'pn1.fn1@d1.com', '2000-01-01', 'poi_1', 'poi_1', '285006', 2),
    ('PN1', 'FN1', 'pn1.fn1@d1.com', '2000-01-01', 'poi_3', 'poa_3', '285006', 3);

As far as I understand, you want to query your database (in an efficient manner) to find persons with the same proof of address or proof of identity, and then deduplicate the results based on the following triplet (name, father's name, and address).

Based on that I propose the following query

-- Import fuzzy matching to add some flexibility to matching addresses
-- (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html)
CREATE EXTENSION fuzzystrmatch;

-- Proposed query
SELECT p1.*
FROM persons p1
JOIN address a1 ON p1.addressId = a1.Id
WHERE (p1.poi_id = 'poi_1' OR p1.poa_id = 'poi_1')
AND EXISTS (
    SELECT 1
    FROM persons p2
    JOIN address a2 ON p2.addressId = a2.Id
    AND p1.Id != p2.Id
    AND levenshtein(p1.name, p2.name) <= 3
    AND levenshtein(p1.father_name, p2.father_name) <= 3
    AND levenshtein(a1.address, a2.address) <= 10
);

Join the persons and address tables on the addressId and Id columns will allow you to access both the person and address information in a single query (straightforward). For address ordering, you can string similarity function such as levenshtein() (what I used here) or pg_trgm() to compare the name, father's name, and address fields. This will allow you to find records that are similar but not necessarily identical.

Results:

You can (and probably should) adjust the similarity thresholds and test different string similarity functions to fine-tune the results. You may also want to consider using an index on the name, father_name, and address columns to speed up the query.

Does this solution follow what I asked in the question? If some different address appears in the poi/poa query we have to do a query on the same and it goes recursively until no new address comes up. — Akhilesh mahajan, Aug 19 '23 at 06:23

Criss Cross Address Deduplication

1 Answers1