SQL query to find rows with the most matching keywords

Question

I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.

Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.

A trimmed down version of the article table looks like this:

CREATE TABLE article (
  id character varying(36) NOT NULL,  -- primary key of article
  keywords character varying,         -- comma separated set of keywords

  CONSTRAINT pk_article PRIMARY KEY (id)
);

-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');

Which would result in this for a SELECT * FROM article; query:

Table: article
------------------------
id  keywords            
------------------------
0   red,green,blue      
1   red,green,yellow    
2   purple,orange,blue  
3   lime,violet,ruby,teal
4   red,green,blue,yellow
5   yellow,brown,black
6   black,white,blue

Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:

------------------------
id  related
------------------------
0   4,1,6
1   4,0,5
2   0,4,6
3   null
4   0,1,6
5   1,6
6   5,0,4

You should **never** store comma separated values in a single column. If you normalize your model, the query gets really easy. — , Jul 20 '15 at 09:06
I'm ok with splitting the keywords into their own table if required. This was just a result of my laziness to get this working. — Shane Rowatt, Jul 20 '15 at 09:20
You should. You are also limiting your keywords, what if you have a keyword with a really long name? There will be a big performance increase. — Jack Pettinger, Jul 20 '15 at 09:21
@JackPettinger I use a set of 10,000 commonly used english words to extract the keywords from the article content so the keyword are on average around 6 characters (as I exclude pronouns). — Shane Rowatt, Jul 20 '15 at 09:29
@wingedpanther 0 | 4,1,6 because row 4 has 3 common keywords (red,green,blue) to row 0, row 1 has 2 common keywords (red,green) to row 0, and row 6 has 1 common keyword (blue) to row 0. Does that make sense ? — Shane Rowatt, Jul 20 '15 at 09:33
`0 | 4,1,6` could as well be `0 | 4,1,2`, because 2 has 1 key in common as well. The gist of it: you need to define a **tiebreaker**. And `5 | 1,6` is wrong. Must be `5 | 1,4,6` or something. — Erwin Brandstetter, Jul 20 '15 at 09:48

Erwin Brandstetter · Accepted Answer · 2022-05-06T03:42:22.050

Like @a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.

Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.

Here is one possible solution based on your design as is:

WITH cte AS (
   SELECT id, string_to_array(a.keywords, ',') AS keys
   FROM   article a
   )
SELECT id, string_agg(b_id, ',') AS best_matches
FROM  (
   SELECT a.id, b.id AS b_id
        , row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
   FROM   cte a
   LEFT   JOIN cte b ON a.id <> b.id AND a.keys && b.keys
   LEFT   JOIN LATERAL (
      SELECT count(*) AS ct
      FROM  (
         SELECT * FROM unnest(a.keys)
         INTERSECT ALL
         SELECT * FROM unnest(b.keys)
         ) i
      ) ct ON TRUE
   ORDER  BY a.id, ct.ct DESC, b.id  -- b.id as tiebreaker
   ) sub
WHERE  rn < 4
GROUP  BY 1;

sqlfiddle (using an integer id instead).

The CTE cte converts the string into an array. You could even have a functional GIN index like that ...

If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.

Detailed explanation in this recent related answer:

Query and order by number of matches in JSON array

The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.

To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.

Oh my, I'm not even going to claim I understand that query but I will give it a try. — Shane Rowatt, Jul 20 '15 at 09:34
@ShaneRowatt: Note the fixed type: `a.keys`, not `b.keys` a 2nd time. — Erwin Brandstetter, Jul 20 '15 at 09:46
Postgres is returning an error at the first select inside the lateral join. ERROR: syntax error at or near "SELECT" LINE 12: SELECT count(*) AS ct ^ ********** Error ********** ERROR: syntax error at or near "SELECT" SQL state: 42601 Character: 366 — Shane Rowatt, Jul 20 '15 at 10:42
@ShaneRowatt: The added SQL fiddle demonstrates that it works for pg 9.3. Note the slight difference, working with integer id in the fiddle. — Erwin Brandstetter, Jul 20 '15 at 10:47
Ok, I've got it working now. Thanks a lot. The `JOIN LATERAL` is something new for me and looks very handy. — Shane Rowatt, Jul 20 '15 at 10:56
For a single resulting column you can easily replace the LATERAL query with a correlated subquery: http://stackoverflow.com/a/28557803/939860 — Erwin Brandstetter, Jul 20 '15 at 11:01

score 2 · Answer 2 · answered Jul 20 '15 at 10:49

You can store lists in comma-separated strings. No problem, as long as this is just a string for you and you are not interested in its separate values. As soon as you are interested in the separate values, as in your example, store them separately.

This said, correct your database design and only then think about the query.

The following query selects all ID pairs first and counts common keywords. It then ranks the pairs by giving the other ID with the most keywords in common rank #1, etc. Then you keep only the three best matching IDs. STRING_AGG lists the best matching IDs in a string ordered by the number of keywords in common.

select 
  this_article as id,
  string_agg(other_article, ',' order by rn) as related
from
(
  select 
    this_article, 
    other_article, 
    row_number() over (partition by this_article order by cnt_common desc) as rn
  from
  (
    select 
      this.id as this_article, 
      other.id as other_article, 
      count(other.id) as cnt_common
    from keywords this
    left join keywords other on other.keyword = this.keyword and other.id <> this.id
    group by this.id, other.id
  ) pairs
) ranked
where rn <= 3
group by this_article
order by this_article;

Here is the SQL fiddle: http://sqlfiddle.com/#!15/1d20c/9.

And here is an SQL fiddle giving you all matches including the counts: http://sqlfiddle.com/#!15/1d20c/12. — Thorsten Kettner, Jul 20 '15 at 10:55

SQL query to find rows with the most matching keywords

2 Answers2

Linked