PGSQL - Joining two tables on complicated condition

Question

I got stuck during database migration on PostgreSQL and need your help.

I have two tables that I need to join: drzewa_mateczne.migracja (data I need to migrate) and ibl_as.t_adres_lesny (dictionary I need to join with migracja).

I need to join them on replace(drzewa_mateczne.migracja.adresy_lesne, ' ', '') = replace(ibl_as.t_adres_lesny.adres, ' ', ''). However my data is not very regular, so I want to join it on first good match with the dictionary.

I've created the following query:

select
count(*)

from
drzewa_mateczne.migracja a

where
length(a.adresy_lesne) > 0
and replace(a.adresy_lesne, ' ', '') = (select substr(replace(al.adres, ' ', ''), 1, length(replace(a.adresy_lesne, ' ', ''))) from ibl_as.t_adres_lesny al limit 1)

The query doesn't return any rows. It does successfully join empty rows if ran without

length(a.adresy_lesne) > 0

The two following queries return rows (as expected):

select replace(adres, ' ', '') 
from ibl_as.t_adres_lesny
where substr(replace(adres, ' ', ''), 1, 16) = '16-15-1-13-180-c'
limit 1


select replace(adresy_lesne, ' ', ''), length(replace(adresy_lesne, ' ', '')) 
from drzewa_mateczne.migracja
where replace(adresy_lesne, ' ', '') = '16-15-1-13-180-c'

I'm suspecting that there might be a problem in sub-query inside the 'where' clause in my query. If you guys could help me resolve this issue, or at least point me in the right direction, I'd be very greatful.

Thanks in advance, Jan

Follow-up question here: http://stackoverflow.com/questions/16570597/joining-two-tables-in-a-complex-query-not-uniform-data — Erwin Brandstetter, May 15 '13 at 17:44

score 1 · Answer 1 · edited May 23 '17 at 11:55

1

You can largely simplify to:

SELECT count(*)
FROM   drzewa_mateczne.migracja a
WHERE  a.adresy_lesne <> ''
AND    EXISTS (
   SELECT 1 FROM ibl_as.t_adres_lesny al 
   WHERE  replace(al.adres, ' ', '')
    LIKE (replace(a.adresy_lesne, ' ', '') || '%')
   )

a.adresy_lesne <> '' does the same as length(a.adresy_lesne) > 0, just faster.
Replace the correlated subquery with an EXISTS semi-join (to get only one match per row).
Replace the complex string construction with a simple LIKE expression.

More information on pattern matching and index support in these related answers:
PostgreSQL LIKE query performance variations
Difference between LIKE and ~ in Postgres
speeding up wildcard text lookups

edited May 23 '17 at 11:55

Community

1
1

answered May 15 '13 at 13:53

Erwin Brandstetter

605,456
145
1,078
1,228

Looks neat and I like it a lot, but ibl_as.t_adres_lesny has 9 mln rows. I've ran the query, but I'm afraid that it won't finish until I leave work. Anyway, as I've discovered thanks to Wojciech Brudniak, I need to select data from both tables, not just filter the first one. Any suggestions on how to do that? – Jan Siekierski May 15 '13 at 14:13
And yeah, I've tried to avoid the 'LIKE' statement because of it's speed. I'm working on 9 mln rows on a slow server, query optimisation is very important here. – Jan Siekierski May 15 '13 at 14:21
@fetta: The `LIKE` variant should be considerably *faster*. In addition, it can be backed up with a functional `text_pattern_ops` index on `replace(al.adres, ' ', '')` to make it *really* fast. Be sure to read about the details in the related answers I am linking to. As to your changed requirements: Write a new question with the exact details. Be sure to include your Postgres version, the relevant table definitions and a sample of your desired result. – Erwin Brandstetter May 15 '13 at 15:02
It isn't faster. It's been running for 80 minutes, I had to kill it. I'll follow your advice and write a new question in a moment. – Jan Siekierski May 15 '13 at 15:26
http://stackoverflow.com/questions/16570597/joining-two-tables-in-a-complex-query-not-uniform-data – Jan Siekierski May 15 '13 at 16:40

score 0 · Answer 2 · answered May 15 '13 at 13:05

0

What you're basically telling the database to do is to get you the count of rows from drzewa_mateczne.migracja that have a non-empty adresy_lesne field that is a prefix of the adres field of a semi-random ibl_as.t_adres_lesny row...

Lose the "limit 1" in the subquery and substitute the "=" with "in" and see if that is what you wanted...

answered May 15 '13 at 13:05

Wojciech Budniak

102
5

I want the first match, not all of the matches. If I do what you suggested, I will get multiple matches in ibl_as.t_adres_lesny for each row in drzewa_mateczne.migracja. I need exactly 1 match when it can be found, and none if it can't. – Jan Siekierski May 15 '13 at 13:26
You're right, I haven't realized that. Even if it works, I still have to add the ibl_as.t_adres_lesny table into the 'from' clause in order to use it in select. However the count(*) still returns 0 rows, if the current 'where' clause worked as intended I'd just get a count of rows that can be matched, right? – Jan Siekierski May 15 '13 at 13:47
You would get the count of drzewa_mateczne.migracja rows that have a match. – Wojciech Budniak May 15 '13 at 14:07

PGSQL - Joining two tables on complicated condition

2 Answers2

Linked