I am assuming you are the only one writing to these tables, so no concurrency conflicts.
Step 1
Add the ID column to table2
, can be NULL for now:
ALTER TABLE table2 ADD COLUMN citizenid int; -- or whatever the type is
Consider an additional flag on table1
to take row "off the market" cheaply on the other side, too:
ALTER TABLE table1 ADD COLUMN hasmatch boolean;
Step 2
Assuming there are no duplicates on either side. Else you need to do more.
Step 3
Update all rows in table2 with a perfect, complete match. Flag matching rows in table1
right away, too. With a data-modifying CTE:
WITH u2 AS (
UPDATE table2 t2
SET citizenid = t1.uniqid
FROM table1 t1
WHERE t1.hasmatch IS NULL -- always with this condition
AND t2.citizenid IS NULL -- always with this condition
AND t2.first_name = t1.first_name
AND t2.last_name = t1.last_name
AND t2.birthdate = t1.birthdate
AND t2.address = t1.address
RETURNING citizenid
)
UPDATE table1 t1
SET hasmatch = TRUE
FROM u2
WHERE t2.citizenid = u2.citizenid;
As soon as a row has its citizenid
it's "off the market" on both sides.
Step 4
Check how many rows are left and start softening predicates in baby steps to keep false positives at bay by always trying the more likely match first. Think of a systematic strategy before you start this cyclic process. Analyze remaining rows to find systematic typos or similar clues.
Possible options for fuzzy matching of character type columns are:
lower(t2.address) = lower(t1.address)
t2.address ILIKE (t1.address || %) -- here % is a wilcard for LIKE pattern
t1.address ILIKE (t2.address || %)
lower(left(t2.address, 20)) = lower(left(t1.address, 20))
t2.address % t1.address -- here % is the similarity operator
levenshtein(t2.address, t1.address) <= 3 -- *very* expensive, no index support
Etc.
The similarity operator %
is provided by the additional module pg_trgm, which also provides trigram indexes to support LIKE
and similarity matches. You will need indexes for fuzzy matching or your queries may take a long time.
Consider partial indexes to remove rows from the index as soon as a match is found. Like:
CREATE INDEX t1_adr_gin_trgm_idx ON table1 USING gin (address gin_trgm_ops)
WHERE t1.hasmatch IS NULL;
CREATE INDEX t2_adr_gin_trgm_idx ON table2 USING gin (address gin_trgm_ops)
WHERE t2.citizenid IS NULL;
Etc.
You can fine tune the similarity threshold for the %
operator with:
SELECT set_limit(0.8);
Step 5
A small percentage is going to remain unresolved. You can spend an increasing amount of time to assign them manually until you decide to discard the rest.
Step 6
Optional. When the process is finished, every row has a citizenid
, which can now be set to NOT NULL
, new rows must have a citizenid
More Details: