1

I need to connect two tables in a query that I will use to insert data to third table (used in the future to join the two). I will mention only relevant columns in these tables.

PostgreSQL version 9.0.5

Table 1: data_table

migrated data, ca 10k rows, relevant columns:

id (primary key),

address (beginning of an address, string that I need to match with the second table. This address has varying length.)

Table 2: dictionary

dictionary, ca 9 mln rows, relevant columns:

id (primary key),

address (full address, string that I need to match with the first table, varying length as well.)

What exactly do I need

I need to correctly connect these tables in a select statement, and then insert these to a third table. All I need is a way to successfully connect these tables.

The way I want to do it is to take each address from data_table, and join it with first address (edit: order by address asc) from dictionary that begins with data_table.address (without multiplying records, as a lot of addresses in dictionary begin with each data_table.address).

Also, addressess in both tables contain a lot of irregular spaces, so we probably need to

replace(address, ' ', '') 

on both of them (any alternative ideas welcome). There might also be some performance issues since dictionary has 9 mln rows and the server is rather slow.

I see the result as some variation of following query:

select 
data_table.id, dictionary_id
from
data_table, dictionary
where
-conditions-
Jan Siekierski
  • 409
  • 6
  • 14
  • 1
    You lost me at "PostgreSQL version 9.0.5". You have major security holes and minor reliability/index corruption issues that are unpatched. Upgrade urgently. As for the rest of the question: **sample data** and actual table definitions please. I recommend http://sqlfiddle.com/ . +1 for making the effort to write a decent explanation and mentioning your Pg version. Comment here when you've edited your question with sample data and explained what you've already tried so I'm notified; I'll try to help out. – Craig Ringer May 15 '13 at 17:06
  • Preceding question for context: http://stackoverflow.com/questions/16565416/pgsql-joining-two-tables-on-complicated-condition Please define "first" in `join it with first address from dictionary`. There is no natural first without definition. Or are you ok with an arbitrary pick? Plus: cardinality of `data_table` (how many rows)? – Erwin Brandstetter May 15 '13 at 17:31
  • @CraigRinger: thanks for the advice. My task is a part of a huge project and it's out of my influence to demand upgrade. I'm back from work now and don't have access to the database, would you like to see a full script to recreate these tables? They have about 15 columns each, all of them irrelevant except for what I posted. Also, is there a way to post sample data in a readable format? ErwinBrandstetter: edited – Jan Siekierski May 15 '13 at 18:41
  • 2
    @fetta Please send these links to your management, then: http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-1899, http://www.postgresql.org/support/security/, http://www.postgresql.org/support/versioning/ . If they won't update to 9.0.13 then I'd be pretty worried about the project. – Craig Ringer May 16 '13 at 00:20

2 Answers2

2
SELECT DISTINCT ON (1)
       t.id, d.address, d.id
FROM   data_table t
JOIN   dictionary d ON replace(d.address, ' ', '')
                 LIKE (replace(t.address, ' ', '') || '%')
ORDER  BY t.id, d.address, d.id

(ORDER BY updated after question update.) Without ORDER BY it's picking an arbitrary match.
Explanation for the technique in this related answer:
Select first row in each GROUP BY group?

A functional index on your dictionary would make this fast:

CREATE INDEX dictionary_address_text_pattern_ops_idx
ON dictionary (replace(address, ' ', '') text_pattern_ops);

More explanation for that in the answer I provided to the precursing question.

One might debate if that gets you the "best" match. One alternative would be a similarity match with a trigram index. Details in the first of the links I added to your last question.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you, I'll try it tommorow at work and give you feedback, but from what I understand this is exactly what I needed. – Jan Siekierski May 15 '13 at 18:40
  • I've created the index and ran the query, but with `limit 5` clause it's running for 20 minutes now. I guess that for 10k rows it just won't execute at all. So, alternative ideas? I believe that something like `first_value(substr(d.address, 1, length(t.address)))over(partition by t.address)` would be faster, but you can't use analytic functions in where clause and I don't have a clear idea how to use it and/or how to join these tables. – Jan Siekierski May 16 '13 at 09:31
  • @fetta: Window functions are powerful but generally *not* faster than this simple solution with `LIKE`. The index is essential for a table with millions of rows. Does a simple `EXPLAIN` show that the index is going to be used? 20 minutes for 5 rows would be an indicator that your database is seriously broken or mis-configured in some way. – Erwin Brandstetter May 16 '13 at 13:15
0

The solution that our architect came up with was writing a function to find the first match.

The function:

CREATE OR REPLACE FUNCTION pick_one_address(text)
  RETURNS text AS
$BODY$
DECLARE
  address_query text;
  toFind text;
  found text;
BEGIN

  toFind := (replace($1, ' ', '') || '%');  
  address_query := 'select al.id from dictionary al where replace(al.adres, '' '', '''') like ''' || toFind ||''' limit 1'; 
  EXECUTE address_query into found;
  RETURN found;

RETURN found_address;
END $BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

The code might seem strange since I did change table names to protect my company's privacy, and didn't mention third table I used to simplify the question, but I guess it should be enough to understand the mechanism.

Thanks for your input @ErwinBrandstetter, @CraigRinger

Jan Siekierski
  • 409
  • 6
  • 14