Match on a basic form
For instance, replace all non-word-characters with regexp_replace()
and fold to lower case:
SELECT lower(regexp_replace (keyword, '\W', '', 'g')) AS folded_key
, count(*) AS ct
FROM tbl
GROUP BY 1;
I use the class shorthand \W
to identify all characters that are not word-characters as defined by ctype of your underlying locale. That should remove all insignificant characters. Casting to lower case is another (optional) step in the folding. You may need to apply more/other string functions to get the basic form you need.
unaccent()
Another useful method for "string normalization" is to remove diacritic signs from lexemes with unaccent()
. Complete details here:
Does PostgreSQL support "accent insensitive" collations?
Match to list of base keywords
In the long run, you should create a lookup table that lists all basic keywords and use a foreign key constraint from the main table. Think of database normalization.
CREATE TABLE keyword
keyword_id serial PRIMARY KEY
,keyword text UNIQUE
);
Such a lookup table would be useful in any case, even without referential integrity (foreign key constraints). Once the basic terms are defined, you can group by similarity, which will give better results than the "brute force" method above. It's not a stab in the dark anymore. Now Postgres knows where to group similar entries together.
There are a couple of possibly useful operators / functions in the additional module fuzzystrmatch. But my first choice would be the similarity operator %
supplied by the module pg_trgm. With these operators you can even tolerate typos and syntax variations. The query could look like this:
-- SELECT set_limit(0.7); -- optional; see below.
SELECT k.keyword, count(*) AS ct
FROM keyword_tbl k
LEFT JOIN tbl t ON k.keyword % t.keyword
GROUP BY 1;
You can calibrate the tolerance for matches with the supplied function set_limit()
.
The LEFT JOIN
includes keywords without match. You may or may not want that. Replace with JOIN
if you don't.
The cherry on top: The similarity operator %
can utilize a GIN or GiST index, which makes this much faster for big tables. Details in this related answer:
PostgreSQL LIKE query performance variations
Combine both
If that's still not accurate enough, you can combine both methods: similarity to a "normalized" string:
SELECT k.keyword, count(*) AS ct
FROM keyword_tbl k
LEFT JOIN tbl t ON k.keyword
% lower(regexp_replace (t.keyword, '\W', '', 'g'))
GROUP BY 1;
You can even support that with a functional GiST index. Example here:
Postgresql: Matching Patterns between Two Columns