142

It is safe to say that the EAV/CR database model is bad. That said,

Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?

In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.

There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.

Further discussion:

So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.

Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:

  • entities add and remove attributes weekly
    (search keywords dictate future attributes)
  • new entities arrive weekly
    (products are assembled from parts)
  • old entities go away weekly
    (archived, less popular, seasonal)

The customer wants to add attributes to the products for two reasons:

  • department / keyword search / comparison chart between like products
  • consumer product configuration before checkout

The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Zachary Scott
  • 20,968
  • 35
  • 123
  • 205
  • Why can't you just have a 'category' table with a foreign key refering to itself? – Noel Kennedy May 15 '09 at 21:31
  • 33
    It's not safe, nor precise, to say that the EAV database model is bad, because it is well suited to some applications. – spencer7593 May 15 '09 at 23:16
  • What if you decorate various objects with various properties, inheriting from a parent like in Entity Framework 4? How does it persist those objects? – Zachary Scott Jul 27 '11 at 02:26
  • 1
    Just back to point to this excellent article about one consultant's experience with a system based on an *extreme* version of EAV. Read it! https://www.simple-talk.com/opinion/opinion-pieces/bad-carma/ – Jeffrey Kemp Jun 13 '13 at 01:51
  • 1
    EAV is a very viable database model. I am working on a similar problem like you and the solution is EAV. I would recommend the following article: http://sqlblog.com/blogs/aaron_bertrand/archive/2009/11/19/what-is-so-bad-about-eav-anyway.aspx –  Oct 23 '14 at 21:24
  • Can anyone elaborate on search query performance? – Adelin Aug 24 '21 at 11:54

10 Answers10

78

There's a few general pros and cons I can think of, there are situations where one is better than the other:

Option 1, EAV Model:

  • Pro: less time to design and develop a simple application
  • Pro: new entities easy to add (might even be added by users?)
  • Pro: "generic" interface components
  • Con: complex code required to validate simple data types
  • Con: much more complex SQL for simple reports
  • Con: complex reports can become almost impossible
  • Con: poor performance for large data sets

Option 2, Modelling each entity separately:

  • Con: more time required to gather requirements and design
  • Con: new entities must be modelled and designed by a professional
  • Con: custom interface components for each entity
  • Pro: data type constraints and validation simple to implement
  • Pro: SQL is easy to write, easy to understand and debug
  • Pro: even the most complex reports are relatively simple
  • Pro: best performance for large data sets

Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)

  • Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
  • Con: new entities must be modelled and designed by a professional
  • Pro: new attributes might be easily added later on
  • Con: complex code required to validate simple data types (for the custom attributes)
  • Con: custom interface components still required, but generic interface components may be possible for the custom attributes
  • Con: SQL becomes complex as soon as any custom attribute is included in a report
  • Con: good performance generally, unless you start need to search by or report by the custom attributes

* I'm not sure if Option 3 would necessarily save any time in the design phase.

Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.

Jeffrey Kemp
  • 59,135
  • 14
  • 106
  • 158
  • What if you had a single table with indexes for text values 1-n, then in C# (in ram) map what you want to what you need. It would still work like an EAV, but the "matches" would be domain models. Sort of like a serialization, but you could use SQL selects on indexed text fields. No multiple selects per record. All the "cost" happens in RAM. – Zachary Scott Jun 15 '10 at 03:06
  • 1
    @Zim, that sounds pretty much like option 3. Each row has 1-n extra "generic" columns, and the data stored in them is interpreted at the application level. You get the performance benefit of having all the data for one record in one place. The metadata about those columns needs to be stored somewhere, however, and this is where the cost creeps in. Sure, we can cache the metadata in ram, but it still costs more than having the domain modelled directly in application code. Certainly better than a fullfledged EAV model though! – Jeffrey Kemp Jun 15 '10 at 11:52
  • 2
    +10000 Great answer. Nowadays people skimp on database design and requirement gathering. They'd rather write a hundred times more lines of code, that take the time to make a good design. – Tulains Córdova Aug 17 '14 at 03:56
  • You don't need *more* design for the relational option (2) than the EAV option (1) if you're only supplying the stucture of option 1. And the relational interface is generic from metadata describing that structure. This removes all the option 2 Cons. However you forgot the only actual Con: DDL can be too slow managing tables. – philipxy Aug 29 '15 at 11:44
  • Hi @philipxy, I didn't say "more design". The raison d'être for the EAV is that (presumably) the system designer can spend *less time* on designing the model, leaving this design work to "users" later on (this lack of professional design leads to the Cons listed for Option 1). If the EAV leads to no savings for the designer that only adds more fuel to the fire for rejecting the EAV out of hand. Also, I disagree that DDL is "too slow" - since it should only be required rarely (i.e. to fix errors in the model, or to implement new features), its performance should be relatively unimportant. – Jeffrey Kemp Aug 31 '15 at 01:19
  • Hi. My point re design is, that raison d'etre is a myth: When there is no up-front design *there is no SQL*, and when design *does* happen the EAV is complex & without DBMS support while the SQL is DDL+DML. My point re DDL is, [the *actual* only reason to use EAV is if DDL is too slow](https://stackoverflow.com/a/32285603/3404097). DDL manipulates DBMS metadata tables exactly as DML manipulates an EAV table, except if you want any DBMS functionality on the table represented by the metadata in the EAV case other than querying it after reconstructing it then you are rolling your own DBMS. – philipxy Oct 26 '18 at 02:01
65

It is safe to say that the EAV/CR database model is bad.

No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.

Now, to your real question: How to store various attributes and keep them searchable?

Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.

EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.

Javier
  • 60,510
  • 8
  • 78
  • 126
  • so the question is i have 15 additional fields for one of my categories and in eav model it reqires 16 join + main table so making 16 left join for searching in products (and having 16 where if custmer want )in 3-4 million records(a website for selling second hand products by people ) so it take perofrmance low ? – babak faghihian Oct 28 '15 at 14:15
  • 2
    If these "additional fields" are already defined, then it would definitely be best done as "real fields". And of course, doing an unbound number of joins in a large query would be a heavy toll (but might still be ok!). What I've done on a metadata-heavy project is to allow any number of "tags" (as EAV records) per "main item", but the "large query" picks only some predefined tagnames, keeping the total number of joins limited (currently typical is just 4 tags and around 5 other joins), and when the user selects a specific item, _then_ it fetchs everything related, but for a single item. – Javier Oct 28 '15 at 19:43
  • but of course, that specific system is currently being ported to an `hstore` field (just one of the reasons why we use PostgreSQL) – Javier Oct 28 '15 at 19:47
16
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.

http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107

The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...

Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.

CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id, 
       sales_order_entity.entity_id, 
       CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type, 
       GROUP_CONCAT( 
         CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
         ORDER BY sales_order_entity_varchar.value DESC
         SEPARATOR '!!!!!' 
       ) as data
  FROM sales_order_entity
       INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
       INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
   AND sales_order_entity.entity_type_id =12
 GROUP BY sales_order_entity.entity_id
 ORDER BY eav_attribute.attribute_code = 'address_type'

Exacts address information for an order, lazily

--

Summary: Only use Magento if:

  1. You are being given large sacks of money
  2. You must
  3. Enjoy pain
Vee
  • 581
  • 8
  • 8
  • 1
    This is an older post but I wish I had found this 3 months ago when I started a Magento project for a client. +1 for the boggle/paint-shaker analogy! – trevorc Jan 12 '11 at 21:52
  • 1
    Pretty interessting, magento seems like it's the king-of-the-road in terms of e-commerce systems. Maybe only it's marketing is very good – Herr Aug 17 '11 at 09:22
  • 1
    Magento isn't popular because of maintenance level, but the ability to customize, allowing anyone to implement new features without architecture changes or few modifications. This feature comes with a cost. – Diego Mendes Jul 27 '15 at 03:03
  • 1
    Stay away from Magento 2 if you want to avoid Triple Pain and More pain on top for both FE and BE – TheBlackBenzKid Aug 21 '17 at 07:45
15

I'm surprised nobody mentioned NoSQL databases.

I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".

Lucas T
  • 661
  • 8
  • 19
13

Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.

I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.

An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.

All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".

Jerry Jasperson
  • 139
  • 1
  • 2
  • 2
    +1 for "Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad"." – Catchops Jul 26 '16 at 20:42
  • Btw, this is called SCD (slowly changing dimensions). Also bitemporal requirements (a specific case of Type 4 SCD) call for EAV schema for the attributes that have this property. Remember, 99% of NoSQL has no native joins, so if you need "live" joins with this type of data, EAV is the only way to go. – cowbert Jul 12 '17 at 04:27
3

I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure): https://forum.virtuemart.net/index.php?topic=58686.0

It seems, that Magento's EAV performance is a real showstopper.

That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.

Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.

The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.

aaimnr
  • 1,646
  • 1
  • 17
  • 31
3

If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.

One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.

bob
  • 173
  • 1
  • 1
  • 7
2

I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.

Amanda Xu
  • 21
  • 1
2

EAV has many drawbacks:

  1. Performance degradation over time Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
  2. The SQL queries are very complex and difficult to write.
  3. Data Integrity problems. You can't define foreign keys for all the fields needed.
  4. You have to define and maintain your own metadata.
herry
  • 1,708
  • 3
  • 17
  • 30
  • 1. This is true for most relational databases too; this is why sharding was invented. 2. Data modelling can be complex and difficult to implement. I've spent weeks-months waiting for OLAP cube schema changes. 3. Already mostly done in software now 4. You have to do this "in ERwin, Excel, and Visio" when modelling a relational schema anyway. – cowbert Jul 12 '17 at 04:28
  • @cowbert: 1. Of course. EAV performance degrades faster than normalized architectures, ie. scales worse. For one example, check out this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC79043/ – Joel Mellon Jun 17 '22 at 18:24
1

I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).

I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.

EAV enter image description here

Array enter image description here

Both schemas have indexes on appropriate columns, and the indexes are used by the planner.

It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.

z0r
  • 8,185
  • 4
  • 64
  • 83
  • how did you do joins on the sheet columns (i.e. vlookup) with the array model? Don't you have to write your own array merge-sort function? Highly doubt it can be as good as the precompiled merge sort if you used sheet_id + x-coordinate+y-coordinate of a cell as the cell value's key. (to emulate excel, pregenerate a lookup table for x-coordinates where 0-18278 are columns A-ZZZ (excel maxes out at 16384)), then you can select values where sheet_id=uuid and x-coord = 0 and y-coord < 1001 to get first 1000 rows of col A. – cowbert Jul 12 '17 at 04:48
  • @cowbert you're right; actually I just load the columns I'm interested in and do the join in Python. Slack! – z0r Jul 13 '17 at 07:52