PostgreSQL - How to JOIN M:M table the right way?

Question

My database structure looks like this:

CREATE TABLE categories (
    name VARCHAR(30) PRIMARY KEY
);

CREATE TABLE additives (
    name VARCHAR(30) PRIMARY KEY
);

CREATE TABLE beverages (
    name VARCHAR(30) PRIMARY KEY,
    description VARCHAR(200),
    price NUMERIC(5, 2) NOT NULL CHECK (price >= 0),
    category VARCHAR(30) NOT NULL REFERENCES categories(name) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE TABLE b_additives_xref (
    bname VARCHAR(30) REFERENCES beverages(name) ON DELETE CASCADE ON UPDATE CASCADE,
    aname VARCHAR(30) REFERENCES additives(name) ON DELETE CASCADE ON UPDATE CASCADE, 
    PRIMARY KEY(bname, aname)
);


INSERT INTO categories VALUES
    ('Cocktails'), ('Biere'), ('Alkoholfreies');

INSERT INTO additives VALUES 
    ('Kaliumphosphat (E 340)'), ('Pektin (E 440)'), ('Citronensäure (E 330)');

INSERT INTO beverages VALUES
    ('Mojito Speciale', 'Cocktail mit Rum, Rohrzucker und Minze', 8, 'Cocktails'),
    ('Franziskaner Weißbier', 'Köstlich mildes Hefeweizen', 6, 'Biere'),
    ('Augustiner Hell', 'Frisch gekühlt vom Fass', 5, 'Biere'),
    ('Coca Cola', 'Coffeeinhaltiges Erfrischungsgetränk', 2.75, 'Alkoholfreies'),
    ('Sprite', 'Erfrischende Zitronenlimonade', 2.50, 'Alkoholfreies'),
    ('Karaffe Wasser', 'Kaltes, gashaltiges Wasser', 6.50, 'Alkoholfreies');

INSERT INTO b_additives_xref VALUES
    ('Coca Cola', 'Kaliumphosphat (E 340)'),
    ('Coca Cola', 'Pektin (E 440)'),
    ('Coca Cola', 'Citronensäure (E 330)');

SqlFiddle

What I am trying to achieve is to list all beverages and their attributes (price, description etc.) and add another column additives from the b_additives_xref table, that holds a concatenated string with all additives contained in each beverage.

My query currently looks like this and is almost working (I guess):

SELECT 
    beverages.name AS name, 
    beverages.description AS description, 
    beverages.price AS price,
    beverages.category AS category, 
    string_agg(additives.name, ', ') AS additives 
FROM beverages, additives
    LEFT JOIN b_additives_xref ON b_additives_xref.aname = additives.name 
GROUP BY beverages.name
ORDER BY beverages.category;

The output looks like:

Coca Cola       | Coffeeinhaltiges Erfrischungsgetränk | 2.75 | Alkoholfreies | Kaliumphosphat (E 340), Pektin (E 440), Citronensäure (E 330)
Karaffe Wasser  | Kaltes, gashaltiges Wasser           | 6.50 | Alkoholfreies | Kaliumphosphat (E 340), Pektin (E 440), Citronensäure (E 330)
Sprite          | Erfrischende Zitronenlimonade        | 2.50 | Alkoholfreies | Kaliumphosphat (E 340), Pektin (E 440), Citronensäure (E 330)
Augustiner Hell | Frisch gekühlt vom Fass              | 5.00 | Biere         | Kaliumphosphat (E 340)[...]

Which, of course, is wrong since only 'Coca Cola' has existing rows in the b_additives_xref table.
Except for the row 'Coca Cola' all other rows should have 'null' or 'empty field' values in the column 'additives'. What am I doing wrong?

Hint#1: use numeric IDs instead of varchar columns, since they are used as FKs. — wildplasser, May 12 '14 at 22:35
I had numeric IDs first for the primary keys, but someone told me there's not really a difference in speed between VARCHAR and NUMERIC FKs. Is this statement false? If not, where is the advantage of using numeric identifier columns? — phew, May 12 '14 at 23:09
0) separation of key<-->value 1) space 2) fixed size. 3) time. ad 12: consider (multiple) FKs pointing to the same additive record ad0: consider renaming or fixing a typo in the case of (multiple) FKs pointing to an entry. — wildplasser, May 12 '14 at 23:17
Allright i'll switch it back to NUMERICs :-) Thanks for the information. — phew, May 12 '14 at 23:18
+1 A question well presented. The only thing missing is your version of Postgres. — Erwin Brandstetter, May 13 '14 at 02:27
I can't even escape stackoverflow when I'm googling my beer! I landed here from a google search on `Franziskaner Weissbier` lol. It's overflowing every time I put it down (which is oddly fitting here). — Millie Smith, May 05 '15 at 05:40

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Some advice on your

Schema

CREATE TABLE category (
   category_id int PRIMARY KEY
  ,category    text UNIQUE NOT NULL
);

CREATE TABLE beverage (
   beverage_id serial PRIMARY KEY
  ,beverage    text UNIQUE NOT NULL  -- maybe not unique?
  ,description text
  ,price       int NOT NULL CHECK (price >= 0)  -- in Cent
  ,category_id int NOT NULL REFERENCES category ON UPDATE CASCADE
                                        -- not: ON DELETE CASCADE 
);

CREATE TABLE additive (
   additive_id serial PRIMARY KEY
  ,additive    text UNIQUE
);

CREATE TABLE bev_add (
    beverage_id int REFERENCES beverage ON DELETE CASCADE ON UPDATE CASCADE
   ,additive_id int REFERENCES additive ON DELETE CASCADE ON UPDATE CASCADE 
   ,PRIMARY KEY(beverage_id, additive_id)
);

Never use "name" as name. It's a terrible, non-descriptive name.
Use small surrogate primary keys, preferably serial columns for big tables or simple integer for small tables. Chances are, the names of beverages and additives are not strictly unique and you want to change them from time to time, which makes them bad candidates for the primary key. integer columns are also smaller and faster to process.
If you only have a handful of categories with no additional attributes, consider an enum instead.
It's good practice to use the same (descriptive) name for foreign key and primary key when they hold the same values.
I never use the plural form as table name unless a single row holds multiple instances of something. Shorter, just a meaningful, leaves the plural for actual plural rows.
Just use text instead of character varying (n).
Think twice before you define a fk constraint to a look-up table with ON DELETE CASCADE
Typically you do not want to delete all beverages automatically if you delete a category (by mistake).
Consider a plain integer column instead of NUMERIC(5, 2) (with the number of Cent instead of € / $). Smaller, faster, simpler. Format on output when needed.

More advice and links in this closely related answer:
How to implement a many-to-many relationship in PostgreSQL?

Query

Adapted to new schema and some general advice.

SELECT b.*, string_agg(a.additive, ', ' ORDER BY a.additive) AS additives
                                     -- order by optional for sorted list
FROM   beverage      b
JOIN   category      c USING (category_id)
LEFT   JOIN bev_add ba USING (beverage_id)  -- simpler now
LEFT   JOIN additive a USING (additive_id)
GROUP  BY b.beverage_id, c.category_id
ORDER  BY c.category;

You don't need a column alias if the column name is the same as the alias.
With the suggested naming convention you can conveniently use USING in joins.
You need to join to category and GROUP BY category_id or category in addition (drawback of suggested schema).
The query will still be faster for big tables, because tables are smaller and indexes are smaller and faster and fewer pages have to be read.

Thank you for the kind advice. Like wildplasser has suggested in a comment above I have added SERIAL identifier columns. I was just about to post on [Code Review](http://codereview.stackexchange.com/) when I noticed your post. I am going to try restructuring the layout with using the conventions you suggested and let you know the [Code Review] link when I'm done. — phew, May 13 '14 at 12:16
As I'm going through your suggestions point-by-point I noticed that the [PostgreSQL DOCS](http://www.postgresql.org/docs/8.1/static/datatype.html#DATATYPE-SERIAL) describe `SERIAL` being equivalent to `col integer DEFAULT nextval('tablename_colname_seq') NOT NULL`. You suggest to use `x_id int PRIMARY KEY` for small tables over `x_id SERIAL PRIMARY KEY`. Is there really a difference? Ofcourse in your example version of my layout the `category` table has `category_id int PRIMARY KEY`, therefore no auto increment. I may want to go with `SERIAL` though because `category` entries may be added and — phew, May 13 '14 at 13:07
@phew: int / serial: No difference in the data, just in the handling. For a small lookup table, where you do not enter values regularly, you may not need the functionality of a `serial` is all. — Erwin Brandstetter, May 13 '14 at 13:24
Allright. Hence category entries may change a lot I am going to stick to `SERIAL`. One more question though; what is the advantage of using `TEMP`? The [PostgreSQL DOCS](http://www.postgresql.org/docs/9.2/static/sql-createtable.html) only state that `TEMP` tables exist in a special schema. Google suggests that `TEMP` significantly increases performance. Trying to execute the `CREATE`s you provided I get an error stating that constraints for `TEMP` tables can only reference `TEMP` tables (I assume that's because table _category_ is not `TEMP` while table _beverage_ is). When to use `TEMP`? — phew, May 13 '14 at 13:34
@phew: Oh, sorry, `TEMP` was an artifact from my local test. It's for temporary tables (which I used since I didn't need to persist). Removed it. Temp tables are faster, but vanish at the end of the session (or sooner if so defined) and are only visible in the current session. — Erwin Brandstetter, May 13 '14 at 14:55

Brad · Accepted Answer · 2014-05-13T00:06:42.540

1

I believe you are looking for this

SELECT 
    B.name AS name, 
    B.description AS description, 
    B.price AS price,
    B.category AS category, 
    string_agg(A.name, ', ') AS additives 
FROM Beverages B
    LEFT JOIN b_additives_xref xref ON xref.bname = B.name 
    Left join additives A on A.name = xref.aname
GROUP BY B.name
ORDER BY B.category;

Output

NAME    DESCRIPTION                                 PRICE   CATEGORY        ADDITIVES
Coca Cola   Coffeeinhaltiges Erfrischungsgetränk    2.75    Alkoholfreies   Kaliumphosphat (E 340), Pektin (E 440), Citronensäure (E 330)

The problem was that you had a Cartesian product between your beverages and additives tables

FROM beverages, additives

Every record got places with every other record. They both need to be explicitly joined to th xref table.

edited May 13 '14 at 00:06

answered May 12 '14 at 22:37

Brad

11,934
4
45
73

It is not exactly what I want but close to it (I assume). I want the Coca Cola row displayed like in your result, along with all the other beverages, one per row, just for those who do not have a `b_additives_xref` table entry the column 'additives' field should be NULL. – phew May 12 '14 at 22:52
1

Your right. I need to switch beverages and additives. I'll edit when i can – Brad May 12 '14 at 23:35
Thank you good sir, I figured it out myself using your query. Just needed a little break to review what you did there, my head was smoking after dealing with it for quite some time earlier. – phew May 12 '14 at 23:47

score 1 · Answer 3 · answered May 12 '14 at 23:49

1

The query I was looking for looks like:

SELECT 
    B.name AS name, 
    B.description AS description, 
    B.price AS price,
    B.category AS category, 
    string_agg(A.name, ', ') AS additives 
FROM beverages B
    LEFT JOIN b_additives_xref xref ON xref.bname = B.name 
    LEFT JOIN additives A on A.name = xref.aname
GROUP BY B.name
ORDER BY B.category;

Credits go to Brad for kind of giving me the solution in his answer & comments.

answered May 12 '14 at 23:49

phew

808
1
15
34

Well, I would have done that but by the time, before he edited the post, the query he posted was not 100% correct. So I figured it out myself and posted the answer, after I did that he corrected his answer and added the working solution. – phew May 13 '14 at 12:30

PostgreSQL - How to JOIN M:M table the right way?

3 Answers3

Schema

Query