Database design for site with many product categories

Question

Im new into SQL. Trying to learn as much as I can and so took small online shop as my training aim. Im' struggling with the database structure. What i want to achieve is:

Categories (for example TV's, Washers) with an ID and name
Fields (Each category will have fields restricted to that category (for example TV's - resolution, HDR etc., Washers - capacity, washing cycles)
Products (Each product should have general fields (name, brand etc) and additionally category fields which will be diffrent in every category.

So the main idea on the system side is to create an category, add fields to it and insert some products into that category using general + category fields.

How do I achieve that? I tried to link that all with one-to-many relation but that doesnt seem to work as i expected

You can check this https://stackoverflow.com/questions/19144200/designing-a-sql-schema-for-a-combination-of-many-to-many-relationship-variation and this https://dba.stackexchange.com/questions/123467/schema-design-for-products-with-multiple-variants-attributes — PSK, Feb 20 '19 at 11:24
Check the EAV model as an extension for category fields, for example in the [Programming with databases](https://www.amazon.com/dp/2956300806) book — serge, Feb 20 '19 at 14:39

score 5 · Accepted Answer · 2019-02-21T13:29:05.373

This is a known (anti) pattern called "Entity Attribute Value" (you can search for that name in the internet if you want to find out more).

Nowadays (and especially with Postgres) I would go for a JSONB column that stores the category specific attributes of each product rather than an additional fields table.

You can even go so far to validate the dynamic attributes in the product table based on the meta-information in the category table.

So something like this:

create table category
(
   id integer primary key, 
   name varchar(50) not null,
   allowed_attributes jsonb not null
);

create table product
(
   id integer primary key, 
   name varchar(100) not null, 
   brand varchar(100) not null, -- that should probably be a foreign key
   ... other common columns ...
);

create table product_category
(
   product_id integer not null references product,
   category_id integer not null references category, 
   attributes jsonb not null, -- category specific attributes
   primary key (product_id, category_id)
);

Now with the list of "allowed attributes" in the category table we can write a trigger that validates them.

First I create a little helper function that makes sure that all keys from one JSON value are present in another:

create function validate_attributes(p_allowed jsonb, p_to_check jsonb)
  returns boolean
as
$$
   select p_allowed ?& (select array_agg(k) from jsonb_object_keys(p_to_check) as t(k));
$$
language sql;

This function is then used in the trigger for the category table:

create function validate_category_trg()
  returns trigger
as
$$
declare
   l_allowed jsonb;
   l_valid   boolean;
begin

   select allowed_attributes 
      into l_allowed
   from category
   where id = new.category_id;

   l_valid := validate_attributes(l_allowed, new.attributes);
   if l_valid = false then 
     raise 'some attributes are not allowed for that category';
   end if;
   return new;
end;
$$
language plpgsql;

Now let's insert some sample data:

insert into category (id, name, allowed_attributes)
values
(1, 'TV Set', '{"display_size": "number", "color": "string"}'::jsonb), 
(2, 'Laptop', '{"ram_gb": "number", "display_size": "number"}');

insert into product (id, name)
values
(1, 'Big TV'),
(2, 'Small  TV'),
(3, 'High-End Laptop');

And now let's insert the category information:

insert into product_category (product_id, category_id, attributes)
values
(1, 1, '{"display_size": 60}'),  -- Big TV 
(2, 1, '{"display_size": 32}'),  -- Small TV
(3, 2, '{"ram_gb": 128}'); -- Laptop

This works as all attributes are defined in the category. If we tried to insert the following:

insert into product_category (product_id, category_id, attributes)
values
(3, 2, '{"usb_ports": 5}');

Then the trigger will throw an exception preventing use from inserting the row.

This can be extended to actually use the data type information stored in the allowed_attributes.

To find products based on attributes, we can use the JSON functions provided by Postgres, e.g. all products that have a display_size:

select p.*
from product p
where exists (select *
              from product_category pc
              where pc.product_id = p.id 
                and pc.attributes ? 'display_size');

Finding products that contain multiple attributes is just as easy (and a lot more complicated with the "traditional" EAV model).

The following query finds only products that have the attributes display_size and ram_gb

select p.*
from product p
where exists (select *
              from product_category pc
              where pc.product_id = p.id 
                and pc.attributes ?& '{display_size, ram_gb}');

This can be indexed quite efficiently to make searching faster.

I am not entirely sure you do want to store the attributes in the product_category table. Maybe they should be stored directly in the product table - but that depends on your requirements and how you want to manage them.

With the above approach you could e.g. have a category "Computer HW" that would store information like number of CPUs, RAM and clock speed. That category (and its attributes) could be used e.g. Smartphones and Laptops at the same time.

However you would need more than one row in product_category to fully describe a product if you do that.

The most common approach is probably to store the attributes directly on the product and skip all the dynamic JSONB validation.

So something like this:

create table category
(
   id integer primary key, 
   name varchar(50) not null
);

create table product
(
   id integer primary key, 
   name varchar(100) not null, 
   brand varchar(100) not null, -- that should probably be a foreign key
   attributes jsonb not null, 
   ... other common columns ...
);

create table product_category
(
   product_id integer not null references product,
   category_id integer not null references category, 
   primary key (product_id, category_id)
);

Or even a combination of both if you need category specific dynamic attributes and product specific attributes regardless of the category.

Why would you use JSON attributes instead of a Fields table? It looks like you have to maintain integrity using procedural code with this approach, and I don't see any advantages that outweigh the risks of doing that. — Jon Heggland, Feb 21 '19 at 13:20
@JonHeggland: you can't maintain integrity with an EAV model either. The JSONB values are more efficient to retrieve than an EAV model - the trigger is actual just an example on what you could do. Typically this wouldn't be necessary - the whole point of EAV or any "dynamic attributes" concept is that you can't know upfront what attributes you will have. — , Feb 21 '19 at 13:21
@JonHeggland: I personally would probably not attach the attributes to the category, but directly to the product itself. — , Feb 21 '19 at 13:29
The integrity the OP asks for, fields restricted by category, can be done with foreign keys in an all-relational model. No need for triggers and user-defined functions. — Jon Heggland, Feb 21 '19 at 13:40
@JonHeggland: not with an EAV model as you can't create a foreign key that works on a subset of rows. So you can't control which attribute is assigned to which product. See this simplified EAV model: https://rextester.com/IDUT69711 — , Feb 21 '19 at 13:48
You would obviously need a CategoryField table containing allowable fields per category, as in Omari Celestine's answer. Or would you not call that an EAV model? — Jon Heggland, Feb 21 '19 at 13:54
@JonHeggland: that just moves the problem area. How do you guarantee that only categories are assigned to a product that have fields that are relevant for that product? You can't do that with a foreign key — , Feb 21 '19 at 14:03

Omari Celestine · Answer 2 · 2019-02-20T12:01:41.357

1

You can create junction tables and foreign keys to represent the relationship between the tables.

Categories Table

id | name

Fields Table

id | name

Category Fields Table

id | category_id | field_id

Brands id | name

Products Table

id | category_id | brand_id | name

Product Features

id | product_id | field_id | value

For the Products tabls you might want to consider having a seperate table for the brands and have the brand_id column in the products table instead of the name so as to avoid repetitions.

The category_fields table will store the id of the category and the id of the related field with each row in the table representing a different field for that category.

Then the table product_features will store the characteristics which would depend on the fields assigned to the category of the product.

edited Feb 20 '19 at 12:01

answered Feb 20 '19 at 11:32

Omari Celestine

1,405
1
11
21

where do i store value of the field in product? example - create new product Iphone X and place 64GB under storage field of that product – Jan Kowalski Feb 20 '19 at 11:44
@JanKowalski I updated my answer which includes a table for the entry of such data. – Omari Celestine Feb 20 '19 at 12:02

score 0 · Answer 3 · answered Feb 21 '19 at 14:43

Using Dataphor-based pseudocode, inline references (foreign keys), data types and irrelevant details omitted:

create table Category {
  CategoryId,
  CategoryName,
  key { CategoryId },
  key { CategoryName } /* Don't want categories that differ only by surrogate id */
};

/* Allowed fields */
create table CategoryField {
  CategoryId,
  FieldName,
  key { CategoryId, FieldName },
  reference CategoryField_Category
    { CategoryId } references Category { CategoryId }
};

create table Product {
  ProductId,
  ProductName,
  ProductBrand,
  CategoryId,
  key { ProductId }, /* Probably other attributes, keys and references as well */
  reference Product_Category
    { CategoryId } references Category { CategoryId }
};

create table ProductFieldValue {
  ProductId,
  CategoryId, /* Violates BCNF, but is controlled by foreign superkey */
  FieldName,
  FieldValue,
  key { ProductId, FieldName },
  reference PFV_Product
    { ProductId, CategoryId } references Product { ProductId, CategoryId },
  reference PFV_CategoryField
    { CategoryId, FieldName } references CategoryField { CategoryId, FieldName }
};

The overlapping foreign keys (I prefer the term "references", especially since one of them is actually a proper foreign superkey) ensure that each product can only have field values according to the rows in the CategoryField table.

There is some redundancy in this model—ProductFieldValue violates Boyce-Codd normal form (2NF as well, but never mind)—so you'll have to decide for yourself if the benefits of simple integrity control outweigh that drawback. But note that the redundancy is controlled; there won't be any inconsistency.

This model assumes that all field values will have the same data type, e.g. a character string. If you want constraints on that as well (e.g. some fields can only have numeric values; some are enumerated etc.), things will become a bit more complicated.

Interesting idea, I like it. However with that model a product can only be assigned to one category which is - in my experience - quite unrealistic in the real world (but the OP didn't explicitly state if that is required, I simply assumed that). One advantage of the JSONB implementation though is that you can query for multiple field values a lot easier ("All products that have a display size of 14 and 16GB RAM) — , Feb 22 '19 at 07:14
You can easily support multiple categories per product by using an association table `ProductCategory { ProductId, CategoryId }` and having `ProductFieldValue` refer to that instead of `Product`. Querying isn't that hard; in D4 I'd write something like `Product having (ProductFieldValue where FieldName = 'Display size' and FieldValue = '14') having (ProductFieldValue where FieldName = 'RAM' and FieldValue = '16GB')`, i.e. two semijoins. Perhaps JSON would be terser or clearer? You'd be mixing two rather different data models / query languages, though. That increases complexity in itself. — Jon Heggland, Feb 22 '19 at 08:34
I don't know this D4 database, but using a hybrid model in Postgres (which this question is about) surely makes things a lot easier, see e.g. here: https://rextester.com/AVBB43832 - all those queries can be supported with a single index on the attributes column and are extremely efficient. You also don't need to add one self join for each attribute that you want to check - you just add the attributes — , Feb 22 '19 at 08:48
That's pretty cool. I wonder if the JSON index works by converting it to relational EAV. :-) (D4 is the language of Dataphor; it's much better designed than SQL, so better for examples/teaching. But of course, you're right that this question is about Postgres.) I'm leery of hybrid models (and the automatic query bias of hierarchical data models), and my point about integrity still stands, I think. I have experienced web shops where comparing products side by side doesn't work properly, because they have been described with fields that supposedly mean the same, but are spelled differently. — Jon Heggland, Feb 22 '19 at 09:27

Database design for site with many product categories

3 Answers3