Select rows with most similar set of attributes

Question

I have a PostgreSQL 8.3.4 DB to keep information about photo taggings.

First off, my table definitions:

create table photos (
   id integer
 , user_id integer
 , primary key (id, user_id)
); 

create table tags (
   photo_id integer
 , user_id integer
 , tag text
 , primary key (user_id, photo_id, tag)
);

What I'm trying to do + simple example:

I am trying to return all the photos that have at least k other photos with at least j common tags.

I. e., if Photo X has these tags (info field in the tags table):

gold 
clock
family

And photo Y has the next tags:

gold
sun 
family
flower

X and Y have 2 tags in common. For k = 1 and j = 2 X and Y will be returned.

What I have tried

    SELECT tags1.user_id , users.name, tags1.photo_id
    FROM users, tags tags1, tags tags2
    WHERE ((tags1.info = tags2.info) AND (tags1.photo_id != tags2.photo_id)
    AND (users.id = tags1.user_id))
    GROUP BY tags1.user_id, tags1.photo_id, tags2.user_id, tags2.photo_id, users.name
    HAVING ((count(tags1.info) = <j>) and (count(*) >= <k>))
    ORDER BY user_id asc, photo_id asc

My failed results:

When I tried to run it on those tables:

photos

photo_id       user_id
   0             0
   1             0
   2             0
   20            1
   23            1
   10            3

tags

photo_id     user_id       tag
   0           0            Car
   0           0            Bridge
   0           0            Sky
   20          1            Car
   20          1            Bridge
   10          3            Sky

The result for k = 1 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |
| 3            | Lev          | 10           |

Actual:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 3            | Lev          | 10           |

For k = 2 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |

Actual: empty result.

For j = 2 and k = 2:
Expected: empty result.

Actual:

|   user_id    |  User Name   |   Photo ID   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |

How to solve this properly?

For `k = 1, j = 1` your expected output contradicts your definition — for photos `0` and `20` has 2 common tags. The same for `k=2, j=2`. Perhaps you should revise your conditions. — vyegorov, Dec 20 '15 at 22:12
Your table definition is not an actual table definition. A complete `CREATE TABLE` script or what you get with `\d tbl` in psql would be *much* more useful. And *always* your version of Postgres. And clarify table and column names, please. `photos` or `users`? `User ID ` or `user_id`? Ambiguity is not helping. — Erwin Brandstetter, Dec 21 '15 at 01:20
I have edited my question and now it meets the advised requirements — wannabe programmer, Dec 21 '15 at 09:16
Are you serious about `PostgreSQL 8.3.4`? [That's just too old and unsupported now.](http://www.postgresql.org/support/versioning/) Consider upgrading to a current version. — Erwin Brandstetter, Dec 21 '15 at 12:53
I took the liberty to roll back your last edit, assuming you removed the essential table definitions by accident ... — Erwin Brandstetter, Dec 21 '15 at 14:16

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):

SELECT *
FROM   photos p
WHERE (
   SELECT count(*) >= 1  -- k other photos
   FROM  (
      SELECT 1
      FROM   tags t1
      JOIN   tags t2 USING (tag)
      WHERE  t1.photo_id =  p.id
      AND    t1.user_id  =  p.user_id
      AND   (t2.photo_id <> p.id OR
             t2.user_id  <> p.user_id)
      GROUP  BY t2.photo_id, t2.user_id
      HAVING count(*) >= 1  -- j common tags
      ) t1
   );

Or:

SELECT *
FROM  (
   SELECT id, user_id
   FROM  (
      SELECT t1.photo_id AS id, t1.user_id
      FROM   tags t1
      JOIN   tags t2 USING (tag)
      WHERE (t2.photo_id <> t1.photo_id OR
             t2.user_id  <> t1.user_id)
      GROUP  BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
      HAVING count(*) >= 1    -- j common tags
      ) sub1
   GROUP  BY 1, 2
   HAVING count(*) >= 1  -- k other photos
   ) sub2
JOIN   photos p USING (id, user_id);

In Postgres 9.3 or later you could use a correlated subquery with a LATERAL join ... The above are probably even faster than my first query:

SELECT *
FROM  (
   SELECT photo_id, user_id
   FROM   tags t
   GROUP  BY 1, 2
   HAVING (
      SELECT count(*) >= 1
      FROM  (
         SELECT photo_id, user_id
         FROM   tags
         WHERE  tag = ANY(array_agg(t.tag))
         AND   (photo_id <> t.photo_id OR
                user_id  <> t.user_id)
         GROUP  BY 1, 2
         HAVING count(*) >= 2
         ) t1
      )
   ) t
JOIN photos p ON p.id = t.photo_id
             AND p.user_id = t.user_id;

SQL Fiddle showing both on Postgres 9.3.

The 1st query just needs the right basic indexes.

For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...
Related:

Order result by count of common array elements

Proper design

It would be much more efficient to have a single column serial PK for photos and only store IDs of tags per photo ...:

CREATE TABLE  photo (
   photo_id serial PRIMARY KEY
 , user_id  int NOT NULL
); 

CREATE TABLE  tag (
   tag_id serial PRIMARY KEY 
 , tag    text UNIQUE NOT NULL
);

CREATE TABLE  photo_tag (
   photo_id int REFERENCES (photo)
 , tag_id   int REFERENCES (tag)
 , PRIMARY KEY (photo_id, tag_id)
);

Would make the query much simpler and faster, too.

How to implement a many-to-many relationship in PostgreSQL?

Thanks, a there an alternative to array_agg? My db doesn't support it. I tried changing it to select tag from tags but it gave me a wrong result for some reason — wannabe programmer, Dec 21 '15 at 12:55
@wannabeprogrammer: I added a solution for pg 8.3 and some more advice. — Erwin Brandstetter, Dec 21 '15 at 13:50
@wannabeprogrammer: Note the fixed bug: Since your update revealed a multicolumn PK, we need to group by all columns of the PK / FK: `GROUP BY t2.photo_id, t2.user_id`. Plus, I added another alternative. — Erwin Brandstetter, Dec 21 '15 at 14:39

score 0 · Answer 2 · answered Dec 20 '15 at 21:25

If I understood you correctly, you want to calculate similarity between all photos of all users by common tags.

I think you need this:

SELECT 
    A.user_id,
    A.photo_id,
    B.user_id,
    B.photo_id,
    (
        SELECT COUNT(*) 
            FROM 
                tags TA 
                JOIN tags TB ON TA.tag = TB.tag
            WHERE   
                A.user_id = TA.user_id
                AND A.photo_id = TA.photo_id
                AND B.user_id = TB.user_id
                AND B.photo_id = TB.photo_id
    ) AS common_tags
FROM
    users A
    ,users B
WHERE
    -- Exclude results to self
    A.user_id <> B.User_id
    AND A.photo_id <> B.photo_id

Where is the j,k part? – wannabe programmer Dec 21 '15 at 05:57 — wannabe programmer, Dec 21 '15 at 05:57

Select rows with most similar set of attributes

2 Answers2

Proper design