Store small array as multiple columns in SQL

Question

I am in an argument with a friend of mine on how to store a small array (<10) of references in SQL. Lets say there is a class player which can hold one item in it's inventory. Describing this as an SQL table would be fairly straight forward:

CREATE TABLE items(
    id        INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    type      VARCHAR(32),
    weight    FLOAT
);
CREATE TABLE players(
    id        INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    name      VARCHAR(128) not null,
    item      INT,
    FOREIGN KEY(item) REFERENCES items(id)
);

The question is now: If the player could hold more than one item, but a small fixed amount of them, would it be better to store them in an additional table and then JOIN over them, like this:

Additional Table

CREATE TABLE players(
    id        INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    name      VARCHAR(128) not null
);
CREATE TABLE inventory(
    id        INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    item      INT NOT NULL,
    player    INT NOT NULL,
    FOREIGN KEY(item) REFERENCES items(id),
    FOREIGN KEY(player) REFERENCES players(id)
);

Or would it be better to just add additional columns? If the amount of items was dynamic and unlimited, this would of course not be an option:

Multiple Columns

CREATE TABLE players(
    id        INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    name      VARCHAR(128) not null
    item1     INT,
    item2     INT,
    item3     INT,
    item4     INT,
    FOREIGN KEY(item1) REFERENCES items(id)
    FOREIGN KEY(item2) REFERENCES items(id)
    FOREIGN KEY(item3) REFERENCES items(id)
    FOREIGN KEY(item4) REFERENCES items(id)
);

This would have the advantage of not having to join over a table that would grow very big very fast but might fragment your table very fast if not any player is carrying all four items all the time.

Which way should be preferred?
Is the second way using multiple columns violating the first normal form?
What are the advantages and disadvantages to both ways?
What will deliver the best performance?

Multiple tables not multiple columns. This is obviously for some game program of sort you are writing. The funny thing about games is getting the balance right. So you might say the player can carry 20 items but find that it too little. Now you want to add 5 items so you add 5 column rework the logic and then test oops 25 is too much lets try 22 now delete 3 columns then rework logic and retest. Using multiple tables will allow you to use a trigger for example to check before insert if the limit is met. Need to change the limit? Change the trigger. — Namphibian, Mar 26 '15 at 06:43

score 3 · Accepted Answer · edited May 23 '17 at 11:43

This might make a good interview question.

Preference is opinon. It depends. However I avoid the "multiple columns" technique (technique #2) if I have more than 2 items per player for many many reasons. For one, if you design and code n=10 items per player, how many items will the project manager want tomorrow? n+1 of course.
I belive the "Multiple Columns" techinque is 1NF Becuase the data is atomic (although it requires null values)

"Many writers misunderstand the concept of a repeating group and use it to claim that a certain table is in violation of 1NF."

https://www.simple-talk.com/sql/learn-sql-server/facts-and-fallacies-about-first-normal-form/

Just because it's 1NF does mean it's a good solution. Normalization per se is not as important as application useability, maintainibilty, and performance. De-normalization is common practice for performance.

see below
What problem are you solving? You give a technique but until you give a problem to solve , you can't measure performance. If may be more preformant for writes but not reads.

Write some example SQL for the questions you application needs to answer. For your technique #2, almost all questions that I can think of require use of subselects (or case statements). These are hard to maintain, I think (hence not 'preferrable') Let's number your two techniques #1 and #2. Here are (too many) example SQL solutions for each:

How many items are in each player?

#1. Select count(inventory.item) from inventory inner join player = 1

#2. really depends on your db, for example MySQL You might use IFNULL(item1,0) and sum them, or CASE statements. Not going to attempt to write this code.

Which players have item id = 9?

select id from players from players inner join inventory on players.id = inventory.player where inventory.item = 9
select id from players where item1=9 or item2=9 or item3=9 ....

Which players have item id X and Y?

select id from players from players inner join inventory on players.id = inventory.player where inventory.item = X or inventory.item = Y;
select id from players where id in (select id from players where item1 = X or item2 = X....) or id in (select id from players where item1 = Y or item2 = Y ...) or ...

Since items have weights, which players have items with weight > 10 ?

select distinct players.* from players inner join inventory on players.id = inventory.player inner join items on inventory.item = items.id where items.weight > 10
select distinct id from players where players.item1 in (select id from items where items.weight > 10) or players.item2 in (select id from items where items.weight > 10) or ...

Notice I'm not finishing the SQL for technique #2. Would you?

There are many other examples of painful SQL. Which players have the highest total weight? Delete all items with a certain id. I'm not going to answer these; for each case in my opinion the sql for technique #2 is harder to maintain (for me == not preferrable).

There may be techniques to make these subselects simpler ( paramertized views, SQL templates in your application code) but it depends on your platform.

Optimizing using indexes becomes problematic also, because it seems to me you'll need an index on every item column in your players table.

If I'm correct that technique #2 requires sub-selects, I hear joins are more efficient (Join vs. sub-query )

Using the technique #1, ( ADDITIONAL TABLE ) just use a trigger or application code to enforce the rule limiting 10 items per player. That kind of rule is easier to change than all of the SELECTs

I should stop now, but here is something else you two can argue about. If your items don't have properties (or the properites are rarely referenced), consider technique #3:

SINGLE COLUMN DELIMITED LIST

CREATE TABLE players( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(128) not null, items VARCHAR(2048) -- or whatever size you need, or TEXT );

INSERT INTO PLAYERS (name, items) values ('player 1', 'itemX, itemY, itemZ');

Not normalized, but who cares if it's fast!

score 1 · Answer 2 · answered Mar 26 '15 at 05:21

1

I hope this isn't a homework problem.

It's difficult to decide which way to go in the context of a small problem. It's going to depend on what other entities exist in the system, and how they are used.

The first is more efficient for smaller datasets, and is easier to maintain, but less flexible than the second and as the number of entities increases the more normalized version will become more efficient.

Strongly recommend you read a few articles or find a good book that addresses database normalization.

Edit: Here this should be a decent start: http://holowczak.com/database-normalization/

answered Mar 26 '15 at 05:21

Sammy

467
3
13

Thank you. I do know about database normalization. So is the approach using multiple columns not normalized? I assure you that this is not a homework problem ;) – Prior99 Mar 26 '15 at 06:07
1

@Prior99 adding multiple columns like that is called de-normalizing the database. It is used in DSS/OLAP system to reduce joins and increase speed. – Namphibian Mar 26 '15 at 06:45

Jay · Answer 3 · 2015-04-09T12:45:00.160

Make another table.

Yes, making multiple columns violates 1NF. Why should you obey this rule? Consider:

(1) Is the limit of 10 absolute? It sounds like this is some sort of game (from the word "player") so maybe it is. But in most applications, such limits tend to be of the "I can't imagine anyone ever having more than ..." variety. I worked on a system years ago for insurance where we had to record the employee's children who were covered by the policy. The original designer decided to create multiple fields, child1, child2, ... child8. He apparently said to himself, "No one would ever have more than 8 children. That will be plenty." Then we got an employee with 9 children and the system blew up.

(2) Let's say you want to test if a player is carrying a certain specified item. With two tables, you write something like

select count(*) from player_item where player_id=@pid and item_id=@iid

If count>0 then the player has the item. With one table, you'd have to write

select count(*) from player where player_id=@pid and
  (item1=@iid or item2=@iid or item3=@iid or item4=@iid or item6=@iid or item7=@iid or item8=@iid or item9=@iid or item10=@iid)

Even for a simple "is it equal" test this is a lot of extra code. And did you notice that I skipped item5? That's an easy mistake to make when typing these repetitive tests over and over. Trust me: I did it once when there were just 3 repeats. Then the program worked correctly if the desired value was in slot 1 or slot 3, but it failed when the value was in slot 2. In most of our tests we only put one item in, so it appeared to work. We didn't catch that one until we went to production.

(3) Suppose you decide that 10 is not the right limit, and you want to change it to 12. With two tables, the only place that would be changed would be the code where you create new ones, to impose the limit of 12 instead of 10. If you did it right, that 10 is a symbolic variable somewhere and not hard-coded, so you change one assignment statement. With one table, you have to change every query that reads that table.

(4) Speaking of searching the table for a given item: with two tables, you can create an index on item_id. Will one table, you need an index on item1, another index on item2, another index on item3, etc. There are 10 indexes for the system to maintain instead of 1.

(5) Joins will be a particular nightmare. It seems likely that you might want to display a list of all the items a player has with some values from the item record, like a name. With two tables, that's

select item.name from player_item
join item on item.item_id=player_item.item_id where player_id=@pid

With one table, it's

select item1.name, item2.name, item3.name, item4.name, item5.name, item6.name, item7.name, item8.name, item9.name, item10.name 
from player 
left join item item1 on item1.item_id=player.item1 
left join item item2 on item2.item_id=player.item2
left join item item3 on item3.item_id=player.item3
...

etc for 10 joins. And if the join is more complex than a simple one value with an id match, you have to repeat all the columns and all the conditions 10 times. Whew! And if later you decide you need to change the conditions, you have to make sure you make the same change ten times.

(5) How do you handle adds and deletes? Does the order matter? Like if you use one table, and there are 4 items filled in, what happens if #3 is deleted? Can we just put a null in slot 3? Or do we need to move the value from slot 4 down to slot 3 and then null out slot 4? When we add new items can they always go on the end or do we have to put them in the middle sometimes? When we show lists of items to the users, do they have to appear in some order? With two tables, we could add a simple "order by name" or "order by turn_acquired" to the query. With one table, you'd have to build an array in memory and sort them. Okay, not a huge deal to do once, but a pain if it comes up many times in the program.

Etc.

Store small array as multiple columns in SQL

Additional Table

Multiple Columns

3 Answers3