2

I'm relatively new to sql and have been trying to get a pretty complex query (for me) to work for a while, but I've kept on getting duplicate values within each column using node postgres. Through this query I am trying to track user info, plan info and email info on a dashboard. Before we get to the query, here are the tables -

USER TABLE (u) - keep track of user info

+----+-------+---------+-------------+----------+
| id | first |  last   |    email    | password | 
+----+-------+---------+-------------+----------+
|  1 | joe   | smith   | j@gmail.com |     1234 | 
|  2 | mary  | johnson | m@gmail.com |     3445 | 
|  3 | harry | gold    | h@gmail.com |     4345 | 
+----+-------+---------+-------------+----------+

PLANS TABLE (p) - plans that users can make with their friends

+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| id | experienceid | hostid(u) | guestid(u) | date |  time  | paidid(u) | groupid | newp |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| 33 |            1 |         1 | [1,2,3]    | 4/20 | 8:00pm | [1,2]     |       1 | true |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+

EMAIL TABLE (e) - keep track of messages I am sending to the users based on the plan they are a part of

+-------------+-----------+---------+----------+
|  email(u)   | planid(p) | confirm | reminder |
+-------------+-----------+---------+----------+
| j@gmail.com |        33 | null    | null     |
| m@gmail.com |        33 | true    | false    |
| h@gmail.com |        33 | true    | false    |
+-------------+-----------+---------+----------+

Now for the query what I am trying to do is combine all three tables to get this output -

+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
| id(p) | hostname(u+p) |      paidguests(u+p)      | time(p) | newp(p) | groupid(p) |  reminder(e)  |  confirm(e)   |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
|    33 | joe smith     | [joe smith, mary johnson] | 8:00pm  | true    |          1 | [true, false] | [true, false] |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+

Now where I left off in the query, I almost got it to work, but I kept getting duplicate values where it would look like this -

+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
| id(p) | hostname(u+p) |                  paidguests(u+p)                   | time(p) | newp(p) | groupid(p) |        reminder(e)         |        confirm(e)         |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
|    33 | joe smith     | [joe smith, mary johnson, joe smith, mary johnson] | 8:00pm  | true    |          1 | [true, false, true, false] | [true, false, true false] |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+ 

Now I don't really care about order for the confirm and reminder columns relative to the paidguests(u+p) column, as long as the right data is in there and not duplicated. Here is the query I had as it stands -

SELECT p.id, 
       Concat(u.first, ' ', u.last)              AS hostname, 
       Array_agg(Concat(us.first, ' ', us.last)) AS paidguests, 
       p.time, 
       p.groupid, 
       p.newp, 
       Array_agg(e.confirm)                      AS confirm, 
       Array_agg(e.reminder)                     AS reminder 
FROM   plans p 
       CROSS JOIN Unnest(p.paidid) AS allguests 
       LEFT JOIN users us 
              ON allguests = us.id 
       LEFT JOIN emails e 
              ON p.id = e.planid 
       LEFT JOIN users u 
              ON p.hostid = u.id 
WHERE  p.experienceid = $1 
       AND p.date = $2 
GROUP  BY p.id, 
          u.first, 
          u.last, 
          p.paidid, 
          p.time, 
          p.groupid, 
          p.newp, 
          confirm, 
          reminder 
ORDER  BY Array_length(p.paidid, 1) DESC 

So essentially just looking to get the table right without the duplicates. It was working before I added the join to the email table, but not entirely sure why its duplicating.

Hope I was thorough in the explanation. If not, please let me know what I can clarify! Thanks so much :)

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
hasaab
  • 63
  • 1
  • 5

2 Answers2

3

try adding this condition in where clause:

AND us.email = e.email

Culprit here is -> all emails are joined with all the users irrespective of their email id since all users and all emails have same plan_id. Hence the duplication.

Rahul Jain
  • 1,319
  • 7
  • 16
0

Rahul spottet the missing join condition. But the rabbit hole goes deeper. I suggest this query:

SELECT p.id
     , concat_ws(' ', u.first, u.last) AS hostname  -- concat_ws!
     , p.time
     , p.groupid
     , p.newp
     , paid.paidguests
     , paid.confirm
     , paid.reminder
FROM   plans      p
LEFT   JOIN users u  ON u.id = p.hostid
LEFT   JOIN LATERAL (         -- LATERAL join
   SELECT array_agg(sub.paidguest) AS paidguests
        , array_agg(sub.confirm)   AS confirm
        , array_agg(sub.reminder)  AS reminder
   FROM  (
      SELECT concat_ws(' ', us.first, us.last) AS paidguest, e.confirm, e.reminder
      FROM   unnest(p.paidid) WITH ORDINALITY AS paid(id, ord)
      JOIN   users       us ON us.id = paid.id
      LEFT   JOIN emails e  ON e.email = us.email
                           AND e.planid = p.planid
      ORDER  BY paid.ord
      ) sub
   ) paid ON true
WHERE  p.experienceid = $1
AND    p.date = $2
-- no GROUP  BY needed
ORDER  BY cardinality(p.paidid) DESC, p.id;

Assuming that (planid, email) is the PRIMARY KEY of the email table and there is a FOREIGN KEY constraint from email to plan.email.

Major points

  • Aggregate first, then join, so you don't need to GROUP BY on all the columns that don't need aggregation. While retrieving all or most rows, other query techniques are typically faster, for a small selection like in your example, I suggest a LATERAL join. Related:

    In this particular case JOIN LATERAL would be equivalent to LEFT JOIN LATERAL since the subuery with aggregates always returns exactly 1 row.

  • The alias in your original Unnest(p.paidid) AS allguests is actively confusing, since those seem to be the IDs of guests who paid, not of all guests.

  • Use concat_ws() if first or last can be NULL. See:

  • When unnesting arrays, the order of elements is typically preserved in the simple case. But you have additional joins, so you should use WITH ORDINALITY and explicit ORDER BY to avoid surprises. Your query may seem to work alright, even for a prolonged period of time - and then "suddenly" seem to break (wrong order of elements) if you are not explicit about this.

Your whole DB design is debatable. Typically, arrays are an anti-pattern for design and should rather be implemented as related tables - for many reasons, beyond the scope of this question.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228