I'm using DENSE_RANK() to produce PRIMARY KEY candidate columns from other TEXT columns.
E.g.
SELECT DENSE_RANK() OVER(ORDER BY title) AS pk_id;
I'm aware than sorting text is slower than sorting integers, so I had the idea to convert the text to an integer format before ranking. (following advice from: Hashing a String to a Numeric Value in PostgreSQL)
My method currently uses a temporary table, as I'm trying to ensure that md5(...)
only runs once per row, and I can't be sure of that if I simply put the md5(...)
call within the OVER(...)
part. What I do instead is: 1) create a temporary table using md5(...)
, then 2) I run DENSE_RANK()
over this other table instead, thus creating my primary key column.
But I thought I'd ask here, to know for sure: Would it actually be faster to use md5(...)
within the OVER(...)
parentheses? I.e.
SELECT DENSE_RANK() OVER(ORDER BY 'x'||substr(md5($1),1,16))::bit(64)::bigint) AS pk_id;
`
My question can also be worded as follows: Would the Postgres query optimiser be good enough to recognise that the "to bigint" conversion is a static procedure? In particular, I'm worried that the md5(...)
conversion would run once per comparison (O(n log n)), rather than once per row (O(n).
PS: One or two hash collisions are okay for my use case.