1

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

For performing the association, a join table exists between those two tables:

CREATE TABLE activity_reason (
   activity_id integer NOT NULL, -- refers to activity.id
   reason_id varchar(1) NOT NULL, -- refers to reason.id
   CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
  CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);

I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:

+--------------+------------+
| activity_id  |  reason_id |
+--------------+------------+
|           1  |          A |
|           1  |          B |
|           2  |          A |
|           2  |          B |
|           3  |          A |
|           4  |          C |
|           4  |          D |
|           4  |          E |
+--------------+------------+

I should have something like:

+-------+---+------+-------+
| count |   |      |       |
+-------+---+------+-------+
|     2 | A | B    | NULL  |
|     1 | A | NULL | NULL  |
|     1 | C | D    | E     |
+-------+---+------+-------+

Or, eventually, something like :

+-------+-------+
| count |       |
+-------+-------+
|     2 | A,B   |
|     1 | A     |
|     1 | C,D,E |
+-------+-------+

I can't find the SQL query to do this.

Julien Fastré
  • 1,008
  • 13
  • 20
  • What are you counting? I'm confused because your example result contains "C,D,E" in it, but your sample data doesn't have a "D" anywhere. Are you trying to get all of the reason_ids that are tied to an activity_id? – rdubya Apr 06 '16 at 00:48
  • `I remove the constraint for readability`. This is a misunderstanding. ***Never*** remove constraints, those are necessary for clarity. Assuming a typical implementation with a PK on `(activity_id, reason_id)`. – Erwin Brandstetter Apr 06 '16 at 02:40
  • @rdubya : I fixed the missing letters. – Julien Fastré Apr 06 '16 at 19:22
  • @erwin-brandstetter : I added the constraints. – Julien Fastré Apr 06 '16 at 19:22
  • [The typical implementation of a many-to-many association has a `UNIQUE` or `PRIMARY KEY` constraint.](http://stackoverflow.com/a/9790225/939860) This is essential to the solution. – Erwin Brandstetter Apr 06 '16 at 20:41

3 Answers3

2

I think you can get what you want using this query:

SELECT count(*) as count, reasons
FROM (
  SELECT activity_id, array_agg(reason_id) AS reasons
  FROM (
    SELECT A.activity_id, AR.reason_id
    FROM activity A
    LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
    ORDER BY activity_id, reason_id
  ) AS ordered_reasons
  GROUP BY activity_id
) reason_arrays
GROUP BY reasons

First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.

Here is a sqlfiddle to demonstrate

As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.

rdubya
  • 2,916
  • 1
  • 16
  • 20
  • you don't need to use the left join. this should do the same: `SELECT count(*) as count, reasons FROM ( SELECT activity_id, array_agg(reason_id) AS reasons FROM activity_reason GROUP BY activity_id ) reason_arrays GROUP BY reasons` – Zhiliang Xing Apr 06 '16 at 02:03
  • @ZhiliangTakutoXing as I mentioned in the answer, the left join is needed if you want to also count activities that don't have reasons associated with them. If those shouldn't be counted, then the left join isn't needed. Also, as I stated in my answer, you need to order the reasons or the arrays will have different orders and won't be grouped correctly. – rdubya Apr 06 '16 at 09:23
1

We need to compare sorted lists of reasons to identify equal sets.

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id) AS reason_list
   FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.

And we don't strictly need the innermost subquery at all. This works as well:

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
   FROM   activity_reason
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

But it's typically slower for processing all or most of the table. Quoting the manual:

Alternatively, supplying the input values from a sorted subquery will usually work.

We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.


If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT sort(array_agg(reason_id)) AS reason_list
   FROM   activity_reason2
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

Related, with more explanation:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

You can do this using string_agg():

select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
      from activity_reason
      group by activity_id
     ) a
group by reasons
order by count(*) desc;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786