0

I'm doing identity resolution on two huge customer tables. I want to join them on the customer's last name, as well as first name up to nickname variations, which I have already prepared:

SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex

The tables are very large, so I figured I would split them up alphabetically. But even after narrowing down t1 and t2 to last names starting with the letter "a", I have 671k rows in t1, and 203k in t2 - the above query fails to complete in many hours.

Basically, what I would like to do is partition the tables into chunks of, say, 67k / 20k (a query with such limits runs in reasonable time) and join chunk to chunk. This can be done because we're working with names alphabetically.

For example, the t1 chunk with last names in, say, (Aaa-Abg) only needs to be checked against the t2 chunk with the same alphabetical name range.

It seems to me like this might be a pretty common scenario, so I'm wondering if there are any best practices regarding such a partition/join.

Is there any way to indicate to postgres that I want to partition my tables into such alphabetical chunks on some column, so that it can prepare the partitions and indices in an automated way, and perform the join chunk by chunk?

P.S. I'm on Postgres 9.4.26 and can't do anything about it

EDIT:

Since the problem seems to be mainly with the regex, let me explain what I tried to achieve here. I want to match by first name, but up to nicknames, so that "Robert" matches with "Bob" etc. So for each fname I built a fname_regex string which looks like "(^robert$|^bob$|^rob$)" using an external nicknames dataset. Didn't expect it to axe performance this badly. If this approach can't be salvaged, then what would be the recommended way of implementing such logic?

EDIT 2:

I made a new question here: Joining two tables by column with several possible values. Improving performance

Spine Feast
  • 235
  • 1
  • 11
  • 1M rows is not a lot of data, much less huge. Regular expressions can't use indexes so that query can't use any indexes and *must* compare all rows that match by name. Partitioning won't fix that problem, it will make it worse, because the server may have to search into multiple files instead of just one – Panagiotis Kanavos Feb 23 '23 at 11:42
  • 1
    Note that Postgres 9.4 is [no longer supported](https://www.postgresql.org/support/versioning/) you should plan an upgrade as soon as possible. –  Feb 23 '23 at 11:43
  • @PanagiotisKanavos: if the table was partitioned by last name I do think a modern Postgres version could use partition wise joins combined with parallel query to speed things up –  Feb 23 '23 at 11:44
  • @PanagiotisKanavos well that's just the letter A. I have 27M and 7M in the full tables, respectively. But even for just the letter A, I cannot get the join to complete. – Spine Feast Feb 23 '23 at 11:45
  • @SpineFeast because `t1.fname ~ t2.fname_regex` can't use indexes. This isn't a common implementation. Database performance comes from indexing, and indexes are good at equality or range checks. `LIKE 'banana%'` can be converted to a range check and therefore can use an index. `LIKE '%anana%'` though can't, and would have to search the entire table to find matches. – Panagiotis Kanavos Feb 23 '23 at 11:48
  • @a_horse_with_no_name the partition is by last name, the regex by first name, so it would have to scan all partitions. – Panagiotis Kanavos Feb 23 '23 at 11:49
  • @PanagiotisKanavos what if I remove the regex from the join condition? essentially replacing the AND with a WHERE – Spine Feast Feb 23 '23 at 11:51
  • @SpineFeast are you looking for full text searching perhaps? That's a common scenario and works by shredding text fields and composing specialized indexes from words *and* their variations. PostgreSQL [supports full text searching](https://www.postgresql.org/docs/current/textsearch.html). There are [specialized index types](https://www.postgresql.org/docs/current/textsearch-indexes.html) for full text searches. – Panagiotis Kanavos Feb 23 '23 at 11:51
  • @PanagiotisKanavos I updated the OP with extra info regarding the regex – Spine Feast Feb 23 '23 at 11:56
  • `essentially replacing the AND with a WHERE` the problem is the condition, not where it's written. What's the purpose of this query? Is this an attempt at fuzzy matching, eg dirty names to cleaned-up names? – Panagiotis Kanavos Feb 23 '23 at 11:56
  • @PanagiotisKanavos it's identity resolution. I'm joining on last name (exact match) and then first name up to nickname, as described above with the regex. Then I will be looking at other columns to determine if they are indeed the same person – Spine Feast Feb 23 '23 at 12:17
  • Best practice would be to upgrade to a recent version since 9.4 is EOL for years. Performance would benefit as well, we got 10x the performance from moving version 10 to 14. – Frank Heikens Feb 23 '23 at 12:22
  • @FrankHeikens not up to me – Spine Feast Feb 23 '23 at 12:26
  • In only example you give, the alterations are anchored both front and end. So they should be equivalent to a `=ANY()` and doing it that way may be more performant. – jjanes Feb 23 '23 at 13:45
  • 2
    @PanagiotisKanavos regular expression *can* be used with trigram indexes. – jjanes Feb 23 '23 at 13:47
  • 1
    @PanagiotisKanavos But the two conditions are ANDed together. The regexp only needs to be evaluated on those partitions which survive the last name conditions, which might be a huge win. – jjanes Feb 23 '23 at 13:58
  • Parallelizing a brute force match is OK if nothing else uses the server. You get 10 cores blocked for an hour instead of 1 core blocked for 10 hours. I didn't know about trigram indexes, and that's exactly the kind of index you need for similarity matches. – Panagiotis Kanavos Feb 23 '23 at 14:03

1 Answers1

1

If the regular expression is your bottleneck, all you can hope for is parallel query (available with supported versions of PostgreSQL), which will allow you to have the query run on several cores in parallel.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263