DB performance with large set of data

Question

I am not asking about indexing nor partitioning, I am asking about a choice, between adding big no. of columns or making the data added as rows instead. Explanation: we currently have a request to make a design to handle multiple properties and their values for some specific products products might reach 100 million record and each product might have multiple properties, so the table for ProductProperties might reach billions. Some people thought of adding properties as columns in ProductProperties table, Property1 and value1,Property2 and value2, etc.... If the product doesn't hold values for the property, the related fields for this property will be null. Also they will add about 80-100 property to be able to cover a wide range of properties dynamically. The Architect refused this approach, as this isn't a good design. Can anyone tell me how to reach good design plus good performance. Thanks

Given N possibilities of properties for all products with M possibilities of properties for a given product; general database design would indicate since properties could change in number as time passes, rows would be the logical choice; as it doesn't require structure changes over time. — xQbert, Apr 15 '12 at 16:05
@Hossam - You might want to consider asking questions like this on [dba.se](http://dba.stackexchange.com/) [(it's not just for database administrators)](http://dba.stackexchange.com/faq) and flagging this for mods to migrate. Questions like this tend to get lost in the noise on SO and often get incorrect answers. — ConcernedOfTunbridgeWells, Apr 15 '12 at 18:20

ConcernedOfTunbridgeWells · Answer 1 · 2012-04-15T17:02:25.233

This problem turns up in a number of guises. In your case, it seems that you have a range of products, each of which may have a different set of properties. I think that you need a way to store these properties in a way that is extensible so you can add new products to the system.

Approach 1: Generic fields on the row + supplementary metadata

The first approach you've suggested could be modified slightly by normalising the product property metadata into its own table:

Build your product table with some generic fields (Code1, Code2, IntVal1, IntVal2, FloatVal1 ...)
Build a supplementary set of parent child reference tables ProductType and ProductAttribute (or some such) that has a guide to what columns on your product table contain which attributes.
Build functionality to interpret this into the data access layer of your application.

The principle advantage of this is that the structure is efficient to query. The downside is that the content of the product table is opaque without the supplementary metadata. However, the inefficiencies and complexities of the other approaches usually radically outweigh this disadvantage.

If the number of different product types is relatively small, you could also use the metadata to generate a view or series of views over the product table that interprets the metadata. This mitigates a lot of the the problems with opacity.

Another advantage is that queries with multiple filter criteria on a product do not have to do multiple joins against a very large child table. If the individual fields on the table are nullable, there is relatively little overhead (typically one byte per column depending on the platform) for each field. Unused fields will waste a snall amount of space on the record.

Approach 2: Entity-attribute-value

This is often proposed as the solution to this class of problem. In this case you have Product and ProductAttribute tables in a parent-child relationship with some reference data that filters product attribute types against product types.

This approach seems conceptually elegant and is extensible , but is fiddly and inefficient to query and occupies considerably more disk space. Some database design hacks can be used on various platforms to mitigate the performance issues. You haven't specified which DBMS platform you're using, so it's hard to point you in the right direction for this. Key advantages and disadvantages of EAV structures are:

Infinitely flexible without having to change the database schena (+)
Inefficient and fiddly to query, particularly if you want to filter by multiple attributes (-)
More disk space usage. (-)

Generally EAV structures are not recommended unless you have a compelling requirement.

Approach 3: XML fields

To paraphrase Fredrick Lundh: 'now you have two problems'. XML fields are infinitely extensible - you can put anything you want into them, but they are opaque to anything but your application, and they are slow and fiddly to query. Getting data out of an XML field in a SQL query is much more work than with data stored on columns.

Generally it is a bad idea to use XML fields in the database to store something that is not inherently an XML document. Many people have written about the un-wisdom of abusing XML fields in a database. My personal experience of building ETL processes to extract data from XML fields leads me to agree. Best avoided unless you have a compelling reason.

Conclusion

Approach 1 is similar to what you've originally proposed, but moves the column metadata out into its own structure. Even though it doesn't look elegant it's the best way to go in almost all cases.

Is it really efficient to have a table with more than 80 product properties? Did the 1st approach suggest that or did I misunderstand your explanation? Right now I'm having a hard time to cope with the fact that more than 80 properties are needed. Maybe the products should be organized into groups and add a few tables to DB so product group A uses properties from one table and group B from another. — ZZ-bb, Apr 15 '12 at 17:37
@ZZ-bb If you have 80 nullable columns then the overhead is normally just 80 bits or 80 bytes per row, depending on the physical implementation. If you collapse this to a generic column set with external metadata this will be smaller still. An EAV structure requires you to join a large child table against the parent multiple times to get all the attributes, and complex searches on this type of structure can be quite inefficient. — ConcernedOfTunbridgeWells, Apr 15 '12 at 17:44
Thanks for the info. Hopefully @Hossam can tell if the product grouping would help to further minimize the null fields. If you have millions of products it's hard to imagine grouping/normalizing is not an option. I hope Hossam doesn't have a product table where there are hundreds of nails, and the only thing different about them is how long or thick they are (but every one of them is an unique item)... — ZZ-bb, Apr 15 '12 at 17:53

score 0 · Answer 2 · edited May 23 '17 at 12:12

I would create two tables: Product and ProductProperties.

Product would contain the basic properties of a single product. The kind of stuff that is needed and common between items such as name, weight, selling_quantity etc.

ProductProperties would contain everything else. Normalize the attributes of properties, name them and create your table . All you need is a FK to Productand you're ready to go. 1:n relationship between the tables is far better then having a single table with 80 or more properties if the most properties are empty (I doubt every product needs 80–100 properties, but I don't know what kind products you are listing).

I don't have any first hand experience in using billions of rows, but databases should be normalized, not filled with empty columns. This answer seems to support my thoughts: Optimal database structure - 'wider' table with empty fields or greater number of tables?

I think your first problem arises when your ProductProperties table has more rows than unsigned bigint can handle. It could take awhile, I hope...

score 0 · Answer 3 · answered Apr 15 '12 at 17:15

0

The existing answers are correct and very good. Here is a new thought: Clearly, splitting the design into two tables (Products, ProductAttributeValues) is the most normalized and correct way to do this.

But performance can trump architectural purity. The only design goal that counts is to reduce the cost of the total solution to a minimum. Nothing else counts. If a denormalized schema improves performance enough that you can save performance efforts elsewhere or that is reduces hardware cost, then it is the right thing to do. Only TCO matters. That simple.

Denormalize, if it saved you work even in the long term, or if it saves on hardware.

answered Apr 15 '12 at 17:15

usr

168,620
35
240
369

Overall I agree but how you determine TCO is subjective. Today, with the requirements known, perhaps having de-normalized data reduces TCO. But in 9 months a requirement comes to add another 20. and the TCO of that solution costs much more than it would have if we started with normalized data... Do you plan for the future or don't you? is the ROI about what you know NOW or what you anticipate in the future? but I digress to @ConcernedOfTunbridgeWells better left to other discussions. – xQbert Apr 15 '12 at 21:17
You optimize the expected TCO over an infinite future as well as you can foresee it. And this is where we drift off into subjectivity... There is just no hard argument to be made in favor of either solution. Did you expect someone to answer "always do X"? The answer is: it depends. You need to estimate what you expect to happen. – usr Apr 15 '12 at 21:19

DB performance with large set of data

3 Answers3