0

I have two tables, already populated with data as follows :

table1 data are citizens data, with fields:

uniqid  (a uniqueIP for person, like social sec num) 
first_name 
last_name
birthdate
address

table2 with fields:

first_name 
last_name
birthdate
address
gender
healthinsurance_type
...

table1 data and table2 data come from different and separate agencies. Names from those tables could be typed differently, and so on.

table1 is authoritative for names and ID. table2 is what I need to work on, but there's no ID (citizenID).

Now I need each row in table2 to get a citizenid, associated from table1, so that in the end I get table2 with additional ID column, correctly associated with each person.

Something like search in table1 for a person (a row in table2) where some conditions match, and if a record exists in table1, put the ID of that record to a column in table2.

But do it fast and for all rows.
row count of table1 is around 2 million.
row count of table2 is around 900.000

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
rifaiz
  • 39
  • 2
  • 9
  • A computer could screw this up really efficiently. basically you can't join those tables. maybe you could create a third table that links the two together or add columns from one table to the other. – Jasen Jul 01 '15 at 06:04
  • You forgot to provide data types and constraints. Provide (the relevant parts of) what you get from psql with `\d tbl`. And, as *always*, your version of Postgres. Plus, can there be duplicates on either side? – Erwin Brandstetter Jul 01 '15 at 06:41
  • vresion 9.1 Table1: uniqid : integer (11 digits) , pk names : char, not null birthdate : date, not null address : char, not null Table2: names,address,gender, insutype : char, not null birthdate : date, not null table1 data are from civil registry office. this is reference. No duplicates. table2 data are list of proposed healthcare recipients,collected manually by officers on the field (minus citizen ID, unfortunately), and then should be submitted to ministry WITH citizen ID on each individual. So I need to put correct citizen ID on each person in table2 , using table1 as a reference – rifaiz Jul 01 '15 at 07:25
  • 1
    Please ***edit*** the question. That kind of information is hardly readable in comments. – Erwin Brandstetter Jul 01 '15 at 07:37

1 Answers1

0

I am assuming you are the only one writing to these tables, so no concurrency conflicts.

Step 1

Add the ID column to table2, can be NULL for now:

ALTER TABLE table2 ADD COLUMN citizenid int; -- or whatever the type is

Consider an additional flag on table1 to take row "off the market" cheaply on the other side, too:

ALTER TABLE table1 ADD COLUMN hasmatch boolean;

Step 2

Assuming there are no duplicates on either side. Else you need to do more.

Step 3

Update all rows in table2 with a perfect, complete match. Flag matching rows in table1 right away, too. With a data-modifying CTE:

WITH u2 AS (
   UPDATE table2 t2
   SET    citizenid = t1.uniqid  
   FROM   table1 t1
   WHERE  t1.hasmatch IS NULL   -- always with this condition
   AND    t2.citizenid IS NULL  -- always with this condition
   AND    t2.first_name = t1.first_name
   AND    t2.last_name  = t1.last_name
   AND    t2.birthdate  = t1.birthdate
   AND    t2.address    = t1.address
   RETURNING citizenid
   )
UPDATE table1 t1
SET    hasmatch = TRUE
FROM   u2
WHERE  t2.citizenid = u2.citizenid;

As soon as a row has its citizenid it's "off the market" on both sides.

Step 4

Check how many rows are left and start softening predicates in baby steps to keep false positives at bay by always trying the more likely match first. Think of a systematic strategy before you start this cyclic process. Analyze remaining rows to find systematic typos or similar clues.

Possible options for fuzzy matching of character type columns are:

lower(t2.address) = lower(t1.address)
t2.address ILIKE (t1.address || %) -- here % is a wilcard for LIKE pattern
t1.address ILIKE (t2.address || %)
lower(left(t2.address, 20)) = lower(left(t1.address, 20))
t2.address % t1.address  -- here % is the similarity operator
levenshtein(t2.address, t1.address) <= 3  -- *very* expensive, no index support

Etc. The similarity operator % is provided by the additional module pg_trgm, which also provides trigram indexes to support LIKE and similarity matches. You will need indexes for fuzzy matching or your queries may take a long time.

Consider partial indexes to remove rows from the index as soon as a match is found. Like:

CREATE INDEX t1_adr_gin_trgm_idx ON table1 USING gin (address gin_trgm_ops)
WHERE t1.hasmatch IS NULL;

CREATE INDEX t2_adr_gin_trgm_idx ON table2 USING gin (address gin_trgm_ops)
WHERE t2.citizenid IS NULL;

Etc.

You can fine tune the similarity threshold for the % operator with:

SELECT set_limit(0.8);

Step 5

A small percentage is going to remain unresolved. You can spend an increasing amount of time to assign them manually until you decide to discard the rest.

Step 6

Optional. When the process is finished, every row has a citizenid, which can now be set to NOT NULL, new rows must have a citizenid

More Details:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks a lot, Erwin, Still relatively novice on CTE though. I can see the direction. – rifaiz Jul 02 '15 at 04:17
  • @rifaiz: [Search for \[postgres\] data-modifying CTE](http://stackoverflow.com/search?q=[postgres]+data-modifying+CTE), you'll find lots of examples and explanation. I also added a link to the manual above. – Erwin Brandstetter Jul 02 '15 at 06:10