The most effective "hash"-function of all is a serial
primary key - giving you a unique number like you wished for in the question.
I also deal with duplicates in this demo:
CREATE TEMP TABLE string (
string_id serial PRIMARY KEY
,string text NOT NULL UNIQUE -- no dupes
,ct int NOT NULL DEFAULT 1 -- count instead of dupe rows
);
Then you would enter new strings like this:
(Data-modifying CTE requires PostgreSQL 9.1 or later.)
WITH x AS (SELECT 'abc'::text AS nu)
, y AS (
UPDATE string s
SET ct = ct + 1
FROM x
WHERE s.string = x.nu
RETURNING TRUE
)
INSERT INTO string (string)
SELECT nu
FROM x
WHERE NOT EXISTS (SELECT 1 FROM y);
If the string nu
already exists, the count (ct
) is increased by 1. If not, a new row is inserted, starting with a count of 1.
The UNIQUE
also adds an index on the column string.string
automatically, which leads to optimal performance for this query.
Add additional logic (triggers ?) for UPDATE
/ DELETE
to make this bullet-proof - if needed.
Note, there is a super-tiny race condition here, if two concurrent transactions try to add the same string at the same moment in time. To be absolutely sure, you could use SERIALIZABLE
transactions. More info and links under this this related question.