1

I'm wondering how to solve the following problem in a good way. So, we have a table with thousands of records that correspond to specific keywords. Because of a custom DSL on our app we end up with several "similar" values like:

office equipment
[office equipment]
+office equipment
office -equipment

etc.

What I need to do is group all of those values and present a summarized one sort of "office equipment 4" referring to the above example.

I could do it on code by using some comparison algorithm but that would mean retrieving thousands of objects into memory. Is there a way to accomplish these using SQL?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Pablo
  • 3,433
  • 7
  • 44
  • 62

2 Answers2

1

Match on a basic form

For instance, replace all non-word-characters with regexp_replace() and fold to lower case:

SELECT lower(regexp_replace (keyword, '\W', '', 'g')) AS folded_key
     , count(*) AS ct
FROM   tbl
GROUP  BY 1;

I use the class shorthand \W to identify all characters that are not word-characters as defined by ctype of your underlying locale. That should remove all insignificant characters. Casting to lower case is another (optional) step in the folding. You may need to apply more/other string functions to get the basic form you need.

unaccent()

Another useful method for "string normalization" is to remove diacritic signs from lexemes with unaccent(). Complete details here:
Does PostgreSQL support "accent insensitive" collations?

Match to list of base keywords

In the long run, you should create a lookup table that lists all basic keywords and use a foreign key constraint from the main table. Think of database normalization.

CREATE TABLE keyword
  keyword_id serial PRIMARY KEY
 ,keyword text UNIQUE
);

Such a lookup table would be useful in any case, even without referential integrity (foreign key constraints). Once the basic terms are defined, you can group by similarity, which will give better results than the "brute force" method above. It's not a stab in the dark anymore. Now Postgres knows where to group similar entries together.

There are a couple of possibly useful operators / functions in the additional module fuzzystrmatch. But my first choice would be the similarity operator % supplied by the module pg_trgm. With these operators you can even tolerate typos and syntax variations. The query could look like this:

-- SELECT set_limit(0.7); -- optional; see below.
SELECT k.keyword, count(*) AS ct
FROM   keyword_tbl k
LEFT   JOIN tbl    t ON k.keyword % t.keyword
GROUP  BY 1;

You can calibrate the tolerance for matches with the supplied function set_limit().

The LEFT JOIN includes keywords without match. You may or may not want that. Replace with JOIN if you don't.

The cherry on top: The similarity operator % can utilize a GIN or GiST index, which makes this much faster for big tables. Details in this related answer:
PostgreSQL LIKE query performance variations

Combine both

If that's still not accurate enough, you can combine both methods: similarity to a "normalized" string:

SELECT k.keyword, count(*) AS ct
FROM   keyword_tbl k
LEFT   JOIN tbl    t ON k.keyword
                      % lower(regexp_replace (t.keyword, '\W', '', 'g'))
GROUP  BY 1;

You can even support that with a functional GiST index. Example here:
Postgresql: Matching Patterns between Two Columns

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This is a good solution if you do not want to do any similarity-checks (phonetics, alignments, etc). – AKDADEVIL Feb 07 '14 at 16:26
  • @AKDADEVIL: I expanded on similarity. The specific difficulty: as long as basic forms of keywords are unknown it is hard to form similarity groups by stabbing in the dark ... – Erwin Brandstetter Feb 07 '14 at 17:29
0

Well, for the examples you give:

select translate(col, '[]+-()', '') as newval, count(*)
from table t
group by 1;

translate() just removes the characters that don't seem to belong.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786