Joining two tables by column with several possible values. Improving performance

Question

Decided to remake this question since the previous approach was suboptimal.

I have two customer tables, t1 and t2 (27M, 7M rows respectively). Each table has last name, first name, and a bunch of other personal data. I am doing identity resolution, so I want to first join them by matching names, and then look at the other columns to determine if they're the same person.

I want to account for the fact that the two tables might have different first names for the same person, when they use a nickname.

Using an external dataset, I created an additional nickname column in t2 which is an array, for example if fname2 is 'robert', then nickname is ['robert','rob','bob']. It always has at least one element, namely fname2 itself.

The query to join this is:

SELECT *
FROM t1
JOIN t2 ON t1.lname1= t2.lname2
AND t1.fname1 = ANY(t2.nicknames)

However, I'm interested in maximizing performance by using the appropriate indices and partitions for this task.

How can I account for the fact that the datasets can be split into mutually exclusive blocks (alphabetically), that only need to be compared to one another?

The query takes too long to run on its own, but if I filter t1 and t2 to only contain last names starting with "A", it runs in just under 2 minutes.

**What I want to do is capture this logic, almost like a for-loop over all the letters. How can I tell postgres to split the tables alphabetically like that and only join the corresponding chunks to each other?

Or simpler still, if that can't be done with this version, how would I write a "for-loop" to do this manually?**

I'd also be interested in finding out what this kind of problem is called, what are some keywords to search for, and where I could read up on this and figure out the recommended practices.

I'm still unfortunately using Postgres 9.4, can't change that...

If you are still using 9.4, then there can be no "best practice". There can only be turd polishing. — jjanes, Feb 23 '23 at 15:33
Don't use postregsql much, but I wouldn't think an array would be very performant, can it be indexed by individual element? [See here](https://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns). Splitting by first letter doesn't seem like it should help much if the columns are indexed property. Could you get rid of the 'Nicknames' on t2 and create a t3 with only NickName and FirstName with a clustered primary key on (NickName, FirstName)? That should link quickly to NickName from t1 and use FirstName to link to t2 which I assume has an index. — Jason Goemaat, Feb 23 '23 at 16:22

score 0 · Answer 1 · answered Feb 24 '23 at 06:13

0

Check Scenario 1:

SELECT t1.*,t2.* FROM t1 JOIN t2 ON t1.lname1= t2.lname2;

Check Scenario 2:

SELECT t1.*,t2.* FROM t1,t2 where t1.lname1=t2.lname2;

Check Scenario 3:

SELECT t1.*,(select t2.lname2 from t2 where t2.lname2=t1.lname1) FROM t1;

answered Feb 24 '23 at 06:13

Abdul Munam

11
4

Joining two tables by column with several possible values. Improving performance

1 Answers1

Linked