I'm doing identity resolution on two huge customer tables. I want to join them on the customer's last name, as well as first name up to nickname variations, which I have already prepared:
SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex
The tables are very large, so I figured I would split them up alphabetically. But even after narrowing down t1 and t2 to last names starting with the letter "a", I have 671k rows in t1, and 203k in t2 - the above query fails to complete in many hours.
Basically, what I would like to do is partition the tables into chunks of, say, 67k / 20k (a query with such limits runs in reasonable time) and join chunk to chunk. This can be done because we're working with names alphabetically.
For example, the t1 chunk with last names in, say, (Aaa-Abg) only needs to be checked against the t2 chunk with the same alphabetical name range.
It seems to me like this might be a pretty common scenario, so I'm wondering if there are any best practices regarding such a partition/join.
Is there any way to indicate to postgres that I want to partition my tables into such alphabetical chunks on some column, so that it can prepare the partitions and indices in an automated way, and perform the join chunk by chunk?
P.S. I'm on Postgres 9.4.26 and can't do anything about it
EDIT:
Since the problem seems to be mainly with the regex, let me explain what I tried to achieve here. I want to match by first name, but up to nicknames, so that "Robert" matches with "Bob" etc. So for each fname I built a fname_regex string which looks like "(^robert$|^bob$|^rob$)"
using an external nicknames dataset. Didn't expect it to axe performance this badly. If this approach can't be salvaged, then what would be the recommended way of implementing such logic?
EDIT 2:
I made a new question here: Joining two tables by column with several possible values. Improving performance