Elegant normalization without adding fields, extra table. Best relationship

Question

I have 2 tables I am trying to normalize. The problem is I don't want to create an offhand table with new fields, though a link table perhaps works. What is the most elegant way to convey that the "Nintendo" entry is BOTH a publisher and a developer? I don't want "Nintendo" to be duplicated. I am thinking a many-to-many relationship can be key here.

I want to stress that I absolutely want the developer and a publisher tables to remain. I don't mind creating a link between the 2 with a new relationship.

Here are the 2 tables I am trying to normalize:

enter image description here

Below is a solution I tried (I don't like it): enter image description here

I'm not sure the concept of "elegance" has a place in database normalization. The relational model is based on mathematics, not fine art; there is something called first order predicate logic involved, which few people are really interested in, but the short version is: You have a many-to-many relationship, whether you like it or not. The only way to express this using normalized tables is with a linking table, full stop. — Air, Jan 03 '15 at 00:42
@AirThomas & klandshome: I agree about the RM & precision but there's no normalization here and no many-to-many relationship except one derivable from the two tables, which express the relationships "developer [id] is company named [name]" and "publisher [id] is company named [name]". — philipxy, Jan 07 '15 at 00:20
This has nothing to do with normalization and it isn't necessary to change your tables. See my updated answer. — philipxy, Jan 07 '15 at 00:21
Contrary to your comment the question does *not* "have" a M:M relationship except as expressible via the two given ones. It and every other relationship expressible in terms of given ones (of which there is an unlimited number) is redundant. That's what views and queries give. (Tables represent relationships. As in "E-R".) And if a non-derivable M:M relationship were identified it would not require changing the two tables. (PS: Relational schemas are logical. Relational schemas as "physical" is a misconception by presenters, methods and tools that do not understand the relational model.) — philipxy, Jan 07 '15 at 01:17

philipxy · Answer 1 · 2021-09-04T00:51:06.010

There is nothing wrong with your two tables.

In fact all you need is

developer(name) -- company [name] is a developer
publisher(name) -- company [name] is a publisher

Your changes have nothing to do with normalization. Normalization never creates new column names. 'I don't want "Nintendo" to be duplicated' is misconceived. There is nothing wrong per se with values appearing in multiple places. See the answers by sqlvogel & myself here.

BUT: Depending on what it means for a row to be in one of your tables there might be a better design to reduce errors because the two tables' values could be "constrained" ie depend on each other. That has something to do with "redundancy" but it is about constraints and does not involve normalization. And for us to address it you have to tell us exactly when a row goes into each table based on the world situation.

If you don't want to repeat the strings for implementation(-dependent) reasons (space taken or speed of operations at the expense of more joins) then add a table of name ids and strings (actually company ids and names) and replace your old name columns and values by company id columns and values. But that's not normalization, that's complicating your schema for the sake of implementation-dependent data optimization tradeoffs. (And you should demonstrate this is needed and works.)

The currently accepted answer (tables Game_Company, Company_Role & Game_Company_Role) just adds a lot of redundant data. Just like your question adds three redundant tables. The original two tables already say what companies are developers and which are publishers. The other tables are just views/queries on the two!

If you want a new table for "[id] identifies a company named [name] with ..." then this is a case of developers and publisher as subtypes of supertype company. Search on database subtypes. See this answer. Then you would use company id instead of name to identify companies. You could also then further simplify (!) by using company id as the only column in tables developer and publisher and also everywhere else instead of developer_id and publisher_id.

"Redundancy" is not about values appearing in multiple places. It is about multiple rows stating the same thing about the application. When using a design like that there are two basic problems: to say certain things multiple rows are involved (while the normalized version involves just one row); and there is no way to say just one of the things at a time (which normalization can help with). If you make two different independent statements about Nintendo then you need two tables and Nintendo mentioned in each one. Re rows making statements about the application see this. (And search my other answers re a table's "statement" or criterion".) Normalization helps because it replaces tables whose rows state things of the form "... AND ..." by other tables that state the "..." separately. See this and this. (Normalization is commonly erroneously thought to involve or include avoiding multiple similar columns, avoiding columns whose values have repetitive structure and/or replacing strings by ids, but although these can be good design ideas they're not normalization.)

In comments, chat and another answer you gave this starting point:

Here's the simplest design. (I'll assume game titles are not unique so you need game_ids.)

-- game [game_id] with title [title] released on [release_date] is rated [rating]
game(game_id,title,release_date,rating)
game_developer(game_id,name) -- game [game_id] is developed by company [name]
game_publisher(game_id,name) -- game [game_id] is published by company [name]
game_platform(game_id,name) -- game [game_id] is on platform [name]

Only if you want a separate list of companies so that a company can exist without developing or publishing and/or can have its own data do you need to add:

company(name,...) -- [name] identifies a company

Only if you want role-specific data for developers and publishers do you need to add:

developer(name,...) -- developer [name] has ...
publisher(name,...) -- publisher [name] has ...

The relevant foreign keys of the various options are straightward.

None of your versions need _ids. Your versions 2 & 3 won't work because they don't say what companies develop a game or what companies publish a game. You don't need roles but if you have them (Verison 2) then you need a table "game [game_id] has company [name] as [role]". Otherwise (Verision 3) you need tables for "[game_id] is developed by company [name]" and "game [game_id] is published by company [name]". Wherever you differ from my designs ask yourself why you have additional structure and why you can do without it and (possibly) why you would explicitly want it anyway.

There is no “new column name”, the name was changed to avoid using a SQL reserved word `TYPE`. There certainly is a problem with “values appearing in multiple places” if they represent the same value. This is basic First Normal Form and has nothing to do with “space taken”; @klansdhome told us the two “Nintendo”s are the same company, putting that in two tables opens the possibility of an update anomaly if you change one but not the other. — Dour High Arch, Jan 07 '15 at 01:33
I didn't say your design did; the user's "solution" introduces "type". I didn't say duplicates aren't ever redundant, I said they were not necessarily redundant. (As with ids in *your own table*.) I didn't say normalization saved space; I said replacing strings by ids and a lookup table can & I said that is *not* normalization. "This is basic 1NF" is wrong; learn about normalization which is about replacing tables by others redistributing columns so the new tables join to the old. There are no update anomolies in the two tables; learn what one is. Also please read my answer more carefully. — philipxy, Jan 07 '15 at 02:34
"'I don't want "Nintendo" to be duplicated' is misconceived. There is nothing wrong per se with values appearing in multiple places." This is an eye-opener. — Snerd, Jan 10 '15 at 00:23
So Nintendo appearing twice is not really "redundant" data. This makes everything much simpler if so and I can just look up a dev or a publisher table nicely since those are the only 2 major roles a company has. Quite simply, a game has a DEVELOPER and PUBLISHER. However, A game can have 1 or many publishers and 1 or many developers. Also repeats. Still trying to get my head around repeats, but it kinda makes sense.. — Snerd, Jan 10 '15 at 00:26
@DourHighArch"Two Nintendos are the same company, putting that in two tables opens the possibility of an update anomaly if you change one but not the other." Is it really important that "two Nintendo's" are the same. Perhaps this is over-thinking and it is ok for it to appear twice BUT in 2 separate tables. Also Nintendo is an example. Many other companies are both developers and publishers. Example: "Konami" is both a publisher and developer in the same way as Nintendo. — Snerd, Jan 10 '15 at 00:49
@klandshome, yes it is critical whether the “two Nintendos” are the same because, if they are, putting it in twice isn't just redundant, it violates First Normal Form and leaves your database vulnerable to update anomalies. Please study database normalization; it is not about “replacing tables by others” it is to prevent your data from becoming inconsistent. — Dour High Arch, Jan 10 '15 at 02:15
Let's think about this conceptually. So what if they are the same? But are they really the same? I mean having a simple developer and a simple publisher table makes it so much easier and less tables. I mean the roles table is only going to have 2 fields: pub and dev. We can think of the 2 Nintendos as "different". One is Nintendo the Pub and the other is Nintendo the Dev. I do a select on all devs from a dev table and all pubs from a pub table. It seams so much simpler and makes sense. And it really is not TRUE redundant data because of the different roles but same name. @philipxy is correct — Snerd, Jan 10 '15 at 04:20
@philipxy Look at my new answer below showing the entire 4 tables and the 3 link-tables to make the many-to-many relationship work. I took your word and read up on the "redundant" data. Also read my new comments above. Thanks for the great insight and this really made me realize a new perspective in design and what redundant conceptually means. I hope I am correct. — Snerd, Jan 10 '15 at 07:27
@philipxy Forgot to add. How about a game that is independently developed and published by one guy as many small apps and indies are. The guy's name is John Doe. He is both the developer and the publisher. So is it fine if John Doe goes into the Developer table AND into the Publisher table? He is the same person. See this is the problem or it appears to be. I stress this a final time and I am still conflicted. A lot of games have a developer and publisher that is the same company. I just hate the idea of having to break it down into more tables as I had above with a roles table, etc. — Snerd, Jan 10 '15 at 07:52
@klandshome A value in a row just contributes to what the DBA tells you a row states by being in a table. 'Nintendo' is a *value* and what a row says using it (about a company or a name or a company with that name) is up to you as DBA and you must tell us. Apparently here the two 'Nintendo's mean *the same thing*, and *that's fine*. Saying Joe is a dev and Joe is a pub does not mean or say that there are two Joes. It just makes TWO STATEMENTS USING 'Joe' and apparently here THEY BOTH USE 'Joe' TO REFER TO THE ONE AND ONLY JOE. (Everything DourHighArch comments here is wrong.) — philipxy, Jan 16 '15 at 17:18

Dour High Arch · Accepted Answer · 2015-01-03T04:06:58.980

1

I think you want something like this:

Game_Company
ID    Name
 1    Retro Studios
 2    HAL Laboratories
 3    Nintendo
 ...

Company_Role
ID    Name
 1    Developer
 2    Publisher
 ...

Game_Company_Role
CompanyID    RoleID
        1         1
        2         1
        3         1
        3         2
 ...

To get a list of all companies that have role 'Developer':

SELECT gc.name
FROM Game_Company gc JOIN Game_Company_Role gcr ON gcr.CompanyID=gc.ID
WHERE gcr.RoleID = 1

edited Jan 03 '15 at 04:06

answered Jan 03 '15 at 00:36

Dour High Arch

21,513
29
75
90

Hmm. So my solution was correct and the only way? I have to create a "role" table? No way to make it work with a Developer and Publisher table? – Snerd Jan 03 '15 at 01:17
Also can field be "Type" under Company_Role instead of "Name"? – Snerd Jan 03 '15 at 01:28
1

Having separate Developer and Publisher tables make their rows different data; developer Nintendo would be a totally different Nintendo than the one in Publisher. Do you want that? – Dour High Arch Jan 03 '15 at 01:30
1

Avoid naming anything in your schema `Type`; it is a SQL reserved word. – Dour High Arch Jan 03 '15 at 01:31
Noted. That is what I though about Nintendo being "doubled." Don't want that. Thanks for the tips. Left the field "name". – Snerd Jan 03 '15 at 01:37
But now. Final task. If I wanted the simplest way to call up only the developers and not the publishers what query would work? – Snerd Jan 03 '15 at 01:38
1

@klandshome & DourHighArch This added table is entirely redundant, it's expressible as a query on the two tables. – philipxy Jan 07 '15 at 00:16

NoChance · Answer 3 · 2015-01-03T07:41:44.247

0

This is a bit generic approach to the problem, it may be of interest. As @Dour High Arch has pointed out in his solution, the Developer and Publisher are just roles for a 'party'. Each part has 0,1 or more roles with a given product and roles may overlap.This is good and bad. For example, a product may be developed by 5 developers but published by at most 1 publisher. I have chosen to introduce a serial_id as system generated PK, but this is not mandatory. You could use the 3FKs as a PK and not user the serial_id.

Notice that having a party as a generalization of different entity types is not always good since 1 or more columns will have to be set to not mandatory if it is not common to all parties, however, this is very common in real applications.

Convention:

name_PK = Primary Key,

name_FK = Foreign Key

edited Jan 03 '15 at 07:41

answered Jan 03 '15 at 00:57

NoChance

5,632
4
31
45

Which program did you use for this UML? – Snerd Jan 03 '15 at 01:41
1

There are many good tools for ERD and UML, I used Gliffy at: https://www.gliffy.com – NoChance Jan 03 '15 at 07:36
1

There is no benefit in all this and it needlessly adds complexity. The original tables already express which companies are developers and which are publishers! – philipxy Jan 07 '15 at 00:27
1

@philipxy, I disagree with you comment. Suppose you want to search for the name "Konami" and you don't know whether this name is a developer or a company, you will have to search 2 tables not one. The CRUD with 2 tables may also have to address two tables which leads to more code. In addition, I mentioned that this a generalization and not necessarily the physical implementation. – NoChance Jan 07 '15 at 11:53
1

(SELECT * FROM game_publisher UNION SELECT * FROM game_publisher) WHERE name = 'Konami'. (How could you think it not possible when your schema does not contain any more info than the two tables do? There is no CRUD update to the two tables that is more complex than in your design since there are no constraints between them and they are in 5NF. I get that you are generalizing; but seriously: when the user's original tables are already adequate? Granted, in comments the OP says they want a company(id,name,...) table; but then the simplest solution is just it & *projections* of the two originals. – philipxy Jan 08 '15 at 01:22
1

Re the UNION: Yes, two tables. But no constraints. Vs a company table and constraints. (But I find the tradeoff to be beside the point since the original tables are adequate.) – philipxy Jan 08 '15 at 01:30
1

@philipxy, thanks for taking the time to explain your point. Generalization usually leads to complex models. – NoChance Jan 08 '15 at 01:37

score 0 · Answer 4 · edited Sep 06 '15 at 01:48

0

Here are three final solutions as proposed by the comments. You can see the table being broken down from the top "un-normalized" table.

The rules are as follows:

1 game can have 1 or many developers and 1 developer can have 1 or many games.
1 game can have 1 or many publishers and 1 publisher can have 1 or many games.
1 game can have 1 or many platforms and 1 platform can have 1 or many games.

Version 1

I left the 2 "Nintendo" entries in red. According to research and implementation, this is not technically redundant data. See my comments under philipxy's answer. This looks simple and elegant. 4 tables with a many-to-many relationship.

enter image description here

Here is the relationship diagram (4 tables and 3 link tables):

enter image description here

Verison 2

Version 1 "repeats" "Nintendo" but Version 2 has a "Company" table instead. Compare the 2 different versions. What is the right way?

enter image description here

Version 3

Here is the subtyping philipxy was talking about. How is this version?

enter image description here

edited Sep 06 '15 at 01:48

philipxy

14,867
6
39
83

answered Jan 10 '15 at 07:23

Snerd

1,463
3
25
36

You are correct. But what do you think of version 2? – Snerd Jan 16 '15 at 08:19
Can I see a visualization of what you mean exactly about not using a role attribute? A little confused. – Snerd Jan 16 '15 at 16:01
Trying to visualize the company table now. So we keep the company table? Can you draw this out? Or do we keep dev and pub.. something is not clear – Snerd Jan 16 '15 at 16:12
Can you draw this out? In you solution then there is a GAME, COMPANY, DEVELOPER, PUBLISHER, and PLATFORM table? – Snerd Jan 16 '15 at 16:28
1

If you want a company table added to 1 then you don't need to use tables role and company_role_company and columns role_id and name. Just replace name by company_id in developer and publisher. See my answer re that and more. I think that is clear. Anyway, you should really take the question part of this answer and post it as a question. – philipxy Jan 16 '15 at 16:32
Why keep 3 tables: company, dev and pub? Too many. The point was to make dev and pub into 1 company table.. – Snerd Jan 16 '15 at 16:45
1

You need either dev & pub or (more complicated) company & role info. Independent of whether we have company for that reason we typically need it anyway for non dev/pub info about a company. As explained in my answer it's a supertype. Ditto for people even if they can be dev/pub. – philipxy Jan 16 '15 at 17:29
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/68998/discussion-between-klandshome-and-philipxy). – Snerd Jan 16 '15 at 23:32
1

I just happened by this answer & updated my answer to address your "top un-normalized" table and other solutions. (I don't recall seeing this second request to chat back then.) – philipxy Sep 06 '15 at 03:18

Elegant normalization without adding fields, extra table. Best relationship

4 Answers4

Linked