Decided to remake this question since the previous approach was suboptimal.
I have two customer tables, t1 and t2 (27M, 7M rows respectively). Each table has last name, first name, and a bunch of other personal data. I am doing identity resolution, so I want to first join them by matching names, and then look at the other columns to determine if they're the same person.
I want to account for the fact that the two tables might have different first names for the same person, when they use a nickname.
Using an external dataset, I created an additional nickname column in t2 which is an array, for example if fname2 is 'robert', then nickname is ['robert','rob','bob']. It always has at least one element, namely fname2 itself.
The query to join this is:
SELECT *
FROM t1
JOIN t2 ON t1.lname1= t2.lname2
AND t1.fname1 = ANY(t2.nicknames)
However, I'm interested in maximizing performance by using the appropriate indices and partitions for this task.
How can I account for the fact that the datasets can be split into mutually exclusive blocks (alphabetically), that only need to be compared to one another?
The query takes too long to run on its own, but if I filter t1 and t2 to only contain last names starting with "A", it runs in just under 2 minutes.
**What I want to do is capture this logic, almost like a for-loop over all the letters. How can I tell postgres to split the tables alphabetically like that and only join the corresponding chunks to each other?
Or simpler still, if that can't be done with this version, how would I write a "for-loop" to do this manually?**
I'd also be interested in finding out what this kind of problem is called, what are some keywords to search for, and where I could read up on this and figure out the recommended practices.
I'm still unfortunately using Postgres 9.4, can't change that...