Find spectators that have seen the same shows (match multiple rows for each)

Question

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.

The model looks like this:

Reservations table 
id_res |     create_date     |  tickets_presented  | id_show | id_spectator | price | category 
-------+---------------------+---------------------+---------+--------------+-------+----------
     1 | 2015-08-05 17:45:03 |                     |       1 |            1 |   195 |        1
     2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 |      11 |            1 |   150 |        2

Spectators table

id_spectator   | last_name  | first_name |                email                   |     create_time     | age 
---------------+------------+------------+----------------------------------------+---------------------+-----   
             1 | gonzalez   | colin      | colin.gonzalez@gmail.com               | 2014-03-15 14:21:30 |  22
             2 | bequet     | camille    | bequet.camille@gmail.com               | 2014-12-10 15:22:31 |  22

Shows table
 id_show |          name          |  kind  | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3 
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
       1 | madama butterfly       | opera  | 2015-09-05        | 19:30:00   | 21:30:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100
       2 | don giovanni           | opera  | 2015-09-12        | 19:30:00   | 21:45:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100

So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.

SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;

Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.

So the result I'm expecting should be something like this

id_spectator | other_id_spectators
-------------+--------------------
            1|                 2,3

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

Really? You can't figure out which table? There are only 3. Which table has ids of spectators and shows? — Hogan, May 01 '15 at 13:48
@Hogan, ok yes I figured we can get the id_spectators and id_shows from the Reservations table but I can't understand how would it lead me to the final answer. — Colin G.D., May 01 '15 at 13:51
I have a policy not to answer a question which is tagged with more than one platform since sql can vary between platforms I'm certain to not know what code the user needs which are you using sql-server or postgresql? — Hogan, May 01 '15 at 13:54
@Hogan, I miss-tagged sql-server, I'm using PostgreSQL only as pointed out in the beginning of the post. — Colin G.D., May 01 '15 at 13:59
Why is this tagged "SQL-Server" if you using Postgres? And what is your version of Postgres (always provide this). Click edit (left under your question) to improve your question. — Erwin Brandstetter, May 01 '15 at 13:59
psql is the name of the client program and its version does not have to be the same as the server - `SELECT version()`. — Erwin Brandstetter, May 01 '15 at 14:02
Here is hint number 2 -- with this version you can use `STRING_AGG` — Hogan, May 01 '15 at 14:07
The expected result contradicts the demo data. There is no spectator 3 and the only reservations made so far are for spectator 1. This should really make sense *together* to be of use to the general public. Also; you did not properly define what you are looking for. AFAIU, this is a problem of "relational division" and is not as trivial as some comments would have you believe. — Erwin Brandstetter, May 01 '15 at 14:26
@Erwin, I understand what you're. I only printed the first two lines of the the tables that's why there's missing data. I didn't want to make a long post, and I believed that the example I gave for the expected result was enough to really understand the point of the query, because I'm not entirely sure I understand the question, the rules aren't very clear, so I can't be very clear either. — Colin G.D., May 01 '15 at 14:35

JMariña · Answer 1 · 2015-05-04T14:20:18.310

3

Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)

There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.

SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]

This will show you all other spectators that attended the same shows.

edited May 04 '15 at 14:20

answered May 01 '15 at 14:42

JMariña

324
2
15

After thinking this through, making a temporary tables wasn't needed. could just use the cast function, updated accordingly. – JMariña May 01 '15 at 15:10
I thought of it, I'm trying your proposition, seems to do the work. Thanks! – Colin G.D. May 01 '15 at 15:26
So instead I can use the string agg function – Colin G.D. May 01 '15 at 15:33
This is not even close to anything that would work in Postgres. `stuff` is not valid in Postgres, neither is `XML PATH` or the non-standard square brackets instead of standard double quotes etc. What's more, it's not working in SQL Server either: http://sqlfiddle.com/#!6/33dc7/1 – Erwin Brandstetter May 01 '15 at 23:05
When the question was posted it had a sql tag so it was answered in that manner. I ran the queries in actual SQL after staging data and it worked, I did use not sql fiddle. Have you tried running it in sql? Is it possible that there is a discrepancy in sqlfiddle vs MS SQL? – JMariña May 02 '15 at 00:15
@JMariña: My link is to an sqlfiddle for *SQL Server*: http://sqlfiddle.com/#!6/33dc7/1. And there is still an SQL tag. You seem to be under the impression that SQL would be connected to MS SQL Server? – Erwin Brandstetter May 02 '15 at 11:46
But you are right, there was an "SQL Server" tag. So I removed the downvote. – Erwin Brandstetter May 02 '15 at 12:19

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)

This is a special case of relational division. We have assembled an arsenal of basic techniques here:

How to filter SQL results in a has-many-through relation

It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.

Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.

A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.

1. Brute force

The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:

SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM  (
   SELECT id_spectator
   FROM   reservations r
   WHERE  id_spectator <> 1
   GROUP  BY 1
   HAVING        array_agg(id_show ORDER BY id_show)
      @> (SELECT array_agg(id_show ORDER BY id_show)
          FROM   reservations
          WHERE  id_spectator = 1)
   ) sub;

But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.

2. Smarter

Use a CTE to determine relevant shows, then only consider those

WITH shows AS (             -- all shows of id 1; 1 row per show
   SELECT id_spectator, id_show
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   )
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM  (
   SELECT s.id_spectator, r.id_spectator AS other
   FROM   shows s
   JOIN   reservations r USING (id_show)
   WHERE  r.id_spectator <> s.id_spectator
   GROUP  BY 1,2
   HAVING count(*) = (SELECT count(*) FROM shows)
   ) sub
GROUP  BY 1;

@> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.

Faster than 1. because only relevant shows are considered.

3. Real smart

To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:

WITH RECURSIVE shows AS (   -- produces exactly 1 row
   SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   GROUP  BY 1
   )
, cte AS (
   SELECT r.id_spectator, 1 AS idx
   FROM   shows s
   JOIN   reservations r ON r.id_show = s.shows[1]
   WHERE  r.id_spectator <> s.id_spectator

   UNION  ALL
   SELECT r.id_spectator, idx + 1
   FROM   cte c
   JOIN   reservations r USING (id_spectator)
   JOIN   shows s ON s.shows[c.idx + 1] = r.id_show
   )
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM   shows s
JOIN   cte c ON c.idx = s.ct  -- has an entry for every show
GROUP  BY 1;

Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).

This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.

SQL Fiddle demonstrating all three.

@ErwinBrandsetter Oh and you're right (id_spectator, id_show) is unique in each reservation — Colin G.D., May 02 '15 at 11:44

score 0 · Answer 3 · answered May 01 '15 at 13:30

0

It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.

What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.

answered May 01 '15 at 13:30

Jeremy Fortune

2,459
1
18
21

I don't think this is very helpful since this is still only describes about half the solution – Hogan May 01 '15 at 13:53

score 0 · Answer 4 · answered May 01 '15 at 15:54

So the final answer I got, looks like this :

SELECT id_spectator, id_show,(
    SELECT string_agg(to_char(A.id_spectator, '999'), ',')
    FROM Reservations A
    WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;

Which prints something like this:

id_spectator | id_show | other_id_spectators 
-------------+---------+---------------------
           1 |       1 |    1,   2,   9
           1 |      14 |    1,   2

Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Find spectators that have seen the same shows (match multiple rows for each)

4 Answers4

1. Brute force

2. Smarter

3. Real smart