0

I'm using DENSE_RANK() to produce PRIMARY KEY candidate columns from other TEXT columns.

E.g.

SELECT DENSE_RANK() OVER(ORDER BY title) AS pk_id;

I'm aware than sorting text is slower than sorting integers, so I had the idea to convert the text to an integer format before ranking. (following advice from: Hashing a String to a Numeric Value in PostgreSQL)

My method currently uses a temporary table, as I'm trying to ensure that md5(...) only runs once per row, and I can't be sure of that if I simply put the md5(...) call within the OVER(...) part. What I do instead is: 1) create a temporary table using md5(...), then 2) I run DENSE_RANK() over this other table instead, thus creating my primary key column.

But I thought I'd ask here, to know for sure: Would it actually be faster to use md5(...) within the OVER(...) parentheses? I.e.

SELECT DENSE_RANK() OVER(ORDER BY 'x'||substr(md5($1),1,16))::bit(64)::bigint) AS pk_id; `

My question can also be worded as follows: Would the Postgres query optimiser be good enough to recognise that the "to bigint" conversion is a static procedure? In particular, I'm worried that the md5(...) conversion would run once per comparison (O(n log n)), rather than once per row (O(n).

PS: One or two hash collisions are okay for my use case.

  • `SELECT DENSE_RANK() OVER(ORDER BY title) AS pk_id` - This query is incomplete. Please add the execution plans for both queries to the question. – The Impaler Mar 03 '23 at 16:12
  • I would use row_number() so every row gets a different number, dense_rank is capable of returning the same number for 2 or more rows. Also you can order by any combination of columns , not sure you gain much by using md5 – Paul Maxwell Mar 04 '23 at 04:30

0 Answers0