158

My requirements are:

  • Need to be able to dynamically add User-Defined fields of any data type
  • Need to be able to query UDFs quickly
  • Need to be able to do calculations on UDFs based on datatype
  • Need to be able to sort UDFs based on datatype

Other Information:

  • I'm looking for performance primarily
  • There are a few million Master records which can have UDF data attached
  • When I last checked, there were over 50mil UDF records in our current database
  • Most of the time, a UDF is only attached to a few thousand of the Master records, not all of them
  • UDFs are not joined or used as keys. They're just data used for queries or reports

Options:

  1. Create a big table with StringValue1, StringValue2... IntValue1, IntValue2,... etc. I hate this idea, but will consider it if someone can tell me it is better than other ideas and why.

  2. Create a dynamic table which adds a new column on demand as needed. I also don't like this idea since I feel performance would be slow unless you indexed every column.

  3. Create a single table containing UDFName, UDFDataType, and Value. When a new UDF gets added, generate a View which pulls just that data and parses it into whatever type is specified. Items which don't meet the parsing criteria return NULL.

  4. Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added

  5. XML DataTypes? I haven't worked with these before but have seen them mentioned. Not sure if they'd give me the results I want, especially with performance.

  6. Something else?

Zoe
  • 27,060
  • 21
  • 118
  • 148
Rachel
  • 130,264
  • 66
  • 304
  • 490
  • 8
    Martin Fowler recommends 2 (user-updateable schema) or 5 (indexed XML LOB): http://martinfowler.com/bliki/UserDefinedField.html – Neil McGuigan Aug 06 '13 at 20:24
  • 1
    See also the StackOverflow question on [dynamic database schemas](https://stackoverflow.com/questions/66385/dynamic-database-schema). – FloverOwe Oct 08 '19 at 19:23
  • 1
    Now many DBs support JSON fields... however indexing / performance is still a problem. For example PostgreSQL (v13) supports [GIN indexes on JSON](https://www.postgresql.org/docs/current/gin-builtin-opclasses.html)... however only equality operators are supported (not range queries). MongoDB seems to have better support with [wildcard indexes](https://docs.mongodb.com/manual/core/index-wildcard/). – collimarco Nov 25 '20 at 10:59

14 Answers14

53

If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.

Pros:

  1. Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.

  2. You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.

  3. You can use table/column names that reflect what the data actually is.

  4. You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.

  5. Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.

Cons:

  1. This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.

  2. There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.

Other Considerations:

  1. If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like

     'red', 'large', 45.03 
    

    rather than

     NULL, 'medium', NULL
    

    In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.

  2. If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.

  3. Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.

Phil Helmer
  • 1,230
  • 8
  • 8
  • Thanks, I think I will do some variation of this. Most of our UDF data comes from unmapped import fields that need to stay around for reference purposes only, so I'd like to put those in one table. Other UDFs are defined as needed (I can't identify them in advance.. they usually are created when we change some process or decide to track something special for a few months) and are commonly used in queries. I think I'll make a separate table for each logical unit of these values. – Rachel Mar 07 '11 at 13:21
  • I'm working with a table that has dated/versioned UDF's, I use this method, http://stackoverflow.com/a/123481/328968, to get at the latest values. – Peter Jul 09 '12 at 14:59
23

I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.

What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.

You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql

Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    Can you explain why I should avoid option #3? I looked at some of your examples, but they really aren't the same as what I am trying to do. I simply want a place to store extra data, not a place to store all attributes. – Rachel Mar 03 '11 at 13:21
  • 2
    For starters, who would you make an attribute NOT NULL? How would you make an attribute UNIQUE without making all attributes UNIQUE? It goes on from there. You end up writing application code to provide features that the RDBMS already provides for you, even to the point of having to write some kind of mapping class for simply inserting a logical entity record and fetching it back. – Bill Karwin Mar 03 '11 at 16:40
  • 2
    The short answer is "don't mix data and metadata." Creating varchar columns for `fieldname` or `tablename` is storing metadata identifiers as data strings, and that's the start of a lot of the problems. Also see http://en.wikipedia.org/wiki/Inner-platform_effect – Bill Karwin Mar 03 '11 at 16:51
  • Whether you store all your attributes in the "dynamic" way or just the subset of user-defined attributes, makes no difference to the level of difficulty introduced or the extra code you have to write to handle them. – Bill Karwin Mar 03 '11 at 16:53
  • 1
    As far as I can tell, this is simply another way to implement an EAV. You have the same limitations with respect to data integrity that you have with an EAV. If by "creating a table" for searching means writing code to auto-populate a hard table, you can do that with a standard EAV and have all the same problems. It does make it harder for someone to make the mistake of querying for a specific attribute (which shouldn't be done in either this solution or an EAV) so in that one respect it is better than a standard EAV. – Thomas Mar 04 '11 at 23:02
  • 2
    @Thomas: In the inverted index design, you can use standard schema solutions for data types, and constraints like UNIQUE and FOREIGN KEY. Those don't work at all when you use EAV. I agree inverted index shares with EAV the trait of being non-relational simply because it supports dissimilar attributes per row, but it's a point of compromise. – Bill Karwin Mar 04 '11 at 23:20
  • Reading the blog article, you are strongly against placing metadata into tables. Yet, that is precisely what SQL does. All metadata is by definition stored by the DDL commands in system tables precisely so that it can be queried. Surely the distinction is really just one of whether the optimiser has enough insight to interpret the data in a way that is appropriate. Parameterised views with typed Attributes can easily restrict the columns to a single consistent domain for querying. – Pekka Dec 08 '13 at 22:19
  • @Pekka, system tables are still tables. They have a fixed set of columns with appropriate data types, one kind of information goes into each column, etc. This is the definition of a *relation* (see many books by C. J. Date to understand further). EAV tables do not meet this definition. The "value" column contains various incomparable types of information, and the virtual table the EAV models does not have a fixed number of attributes. – Bill Karwin Dec 09 '13 at 00:51
  • You were using arguments both for and against. I realise the multiple domain problem is present in the physical schema, but nothing prevents a universal join of the attribute table from appearing like any other table in the conceptual schema. Also, A fixed number of columns is hardly a criterion - consider ALTER TABLE ADD .... – Pekka Dec 09 '13 at 06:45
  • @Pekka, after ALTER TABLE ADD COLUMN, *every* row has the same set of columns. You can change the set, but it remains fixed with respect to every row. That's fundamental to being relational. Allowing each row be able to decide its own set of columns is simply not a relational table. – Bill Karwin Dec 09 '13 at 15:19
  • Bill, I don't disagree that the conceptual schema needs to represent a relation with limited domains and I did not suggest that any row can decide on its own set of columns. I pointed out that saying that table columns cannot change is inaccurate, and that a universal join of the columns in the EAV table meets the requirements of a relation. This may not sit with your preference, but wishing it will not make it go away. It is better to accept situations and work with them. In some cases, it may not be possible to change the code, then the best option is to change the data. – Pekka Dec 09 '13 at 20:49
  • @Pekka, Please explain how to model a NOT NULL attribute in EAV. – Bill Karwin Dec 10 '13 at 05:03
  • See http://sqlfiddle.com/#!3/e9675/4 for a sample schema. Note the trigger does not actually fire and the restricted row is added. The result is correct when run on a local instance. Is there anything else that prevents the approach from being relational? – Pekka Dec 10 '13 at 19:34
  • @Pekka, it is not relational, for several reasons, including but not limited to: (a) you can't implement any constraints against the `Value` column, and (b) the physical `Value` column contains values of all attributes. A column must contain only values from one data type. Please read a book such as [SQL and Relational Theory](http://shop.oreilly.com/product/9780596523084.do) by C. J. Date if you don't believe me. Date is the foremost living authority on what it means to be relational. – Bill Karwin Dec 11 '13 at 02:13
  • Nothing prevents triggers from validating both constraints and types. However, I can see that you will not consider this beyond its surface. I'll stop now. – Pekka Dec 11 '13 at 06:06
  • I've used this method over the table-per-UDF described in Phil Heimer's answer because it was fastest to implement. The problem is that users who define custom fields almost always need them sorted and filtered in some way. This inevitably forces me back to the table-per-UDF approach. – Jonn Jul 22 '15 at 11:16
  • greetings mr karwin, thanks for your help. three questions: 1) you said '...also create additional tables...' that means same data in xml blob and the supplementary table. is this duplication of data acceptable in this scenario (thinking sync issues etc...) and 2) if some udf types are *always* present "together" e.g. 'udf_height' and 'udf_width' then is just one supplementary table for both enough (with proper indexes on both)? and finally 3) lots of tables are going to be generated through application's lifetime, and application will be using udf_ID values to looking up for *tables*. 1/2 – Sharky Nov 13 '16 at 10:53
  • ...is a) the big number of tables and b) using *names* of tables to carry information a good practice in this situation? thanks again for your contribution your answers are always a must read for me! 2/2 – Sharky Nov 13 '16 at 10:56
  • Hi @BillKarwin, I am sure you are a super experienced in dev/DBA. In 2018, would you still consider this solution optimal? – thitami Mar 11 '18 at 11:02
  • 2
    @thitami, What I've learned over the years is that *any* solution might be the right one for your app. Even EAV might be the least bad solution for some specific app. You can't choose an optimization strategy without knowing your queries. Every kind of optimization improves certain queries at the expense of other queries. – Bill Karwin Mar 11 '18 at 19:50
  • @BillKarwin Many thanks for your answer! I have decided to go with EAV too :) – thitami Mar 12 '18 at 10:54
  • 1
    @thitami, Good luck. EAV always requires a lot of work from the developer, it's highly error-prone, and it takes several times more storage space than normal columns. As long as you know what you're getting into... then good luck. – Bill Karwin Mar 12 '18 at 14:48
12

This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.

They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.

I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.

You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.

You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.

With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.

If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.

Data Monk
  • 1,279
  • 9
  • 15
11

I would most probably create a table of the following structure:

  • varchar Name
  • varchar Type
  • decimal NumberValue
  • varchar StringValue
  • date DateValue

The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.

You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.

You may want to have some kind of meta data information. So you end up with the following:

Table UdfMetaData

  • int id
  • varchar Name
  • varchar Type

Table MasterUdfValues

  • int Master_FK
  • int MetaData_FK
  • decimal NumberValue
  • varchar StringValue
  • date DateValue

Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.

Stefan Steinegger
  • 63,782
  • 15
  • 129
  • 193
  • I like your strategy, and maybe opt for it but in 2017, Will you opt for something different? like json – maztt Feb 28 '17 at 11:52
  • In our project, we implemented our own data structures which serialize to something similar to json. It features a typesave interface to read and write data without casting and with great programming language integration. That's really great. It has the same problem as all of this kind of "documents" in databases. It's hard to query for spcific values and it cannot easily reference data outside of the "document". Depending on the usage, both is not even an issue. – Stefan Steinegger Mar 08 '17 at 15:23
  • Besides of that, what I proposed in 2011 is IMHO still a valid solution. – Stefan Steinegger Mar 08 '17 at 15:25
6

Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.

Option 1

IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")

Option 3,4,5

IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.

Option 2,6

That really leaves one choice: gather specifications and then build out the schema.

If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.

If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.

Sparse Columns

Community
  • 1
  • 1
Thomas
  • 63,911
  • 12
  • 95
  • 141
5
  1. Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added

According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.

You should go with single table with multiple columns like:

varchar Name
varchar Type
decimal NumberValue
varchar StringValue
date DateValue
Rachel
  • 130,264
  • 66
  • 304
  • 490
Amit Contractor
  • 169
  • 3
  • 3
5

This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.

This is also the solution used in some commercial enterprise applications.

EDIT

another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.

many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them

postgress

mysql

Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106
  • 1
    I hate the idea of creating possibly hundreds of unused columns. It goes against what I've learned and read about SQL database design. Right now, we have over 1300 different user-defined values, although a lot of them are simply duplicates of existing items which are named differently. – Rachel Feb 24 '11 at 15:14
  • 1300 different UDF for a single table? does each user have the option to add UDF, or only some kind of power user? – Ophir Yoktan Feb 24 '11 at 15:23
  • Its part of the import process... it adds any non-mapped data to a user-defined field. Since no one takes the time to map unmapped data to existing UDF fields, it just creates new ones and over the years a lot have been added. – Rachel Feb 24 '11 at 15:28
4

Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:

  1. (EntityID, FieldID, Value) table for searching the data
  2. a JSON field in the entities table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values).

You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.

P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can throw a couple of bucks on cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine

Alex from Jitbit
  • 53,710
  • 19
  • 160
  • 149
  • Hi @Alex, I came across a similar issue. If I understand well you have got: 1) a `custom_fields` table storing values such as 1=> `last_concert_year`, 2 => `band`, 3 => `music` and then a `custom_fields_values` table with values 001, 1, 1976 002, 1, 1977 003, 2, `Iron Maiden` 003, 3, `Metal` Hope the example makes sense to you and sorry for the formatting! – thitami May 03 '18 at 10:04
  • @thitami not exactly. Following your example: I have a `bands` table with a row `1,'Iron Maiden'` then `custom_fields` with rows `1,'concert_year' | 2,'music'` then `custom_fields_values` with rows `1,1,'1977'|1,2,'metal'` – Alex from Jitbit May 03 '18 at 16:40
  • @AlexfromJitbit A quick question, if you have more than million users, when creating a new Custom Field an insert trigger would have to be utilized to insert `custom_field_value` rows for each user right? – Shyamal Parikh Jun 20 '21 at 05:43
  • Also how do you make sure there's consistency between the `json` field and the `custom_field_value` rows? – Shyamal Parikh Jun 20 '21 at 05:50
  • @ShyamalParikh we did not use triggers, the consistency was managed on the application side. When we added a new custom field - nothing happens, unless you SET the field for a specific user. When we DELETE'd a field - also nothing happens, we just verify the field exists when displaying JSON on the app side. – Alex from Jitbit Jun 20 '21 at 16:54
2

I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.

I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance. (see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example

Jon Egerton
  • 40,401
  • 11
  • 97
  • 129
  • Honestly, I haven't looked into XML at all. The main downside to that is I'd have to learn how it worked and how to query against it, and I've heard performance can be worse than the other options – Rachel Feb 24 '11 at 15:12
  • 1
    I'd avoid using xml for this: it can do the job, and I've implemented someting like this in xml in the past, but performance got quite bad as the data structures grew, and the code complexity was high. – Kell Mar 04 '11 at 09:47
2

If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.

XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.

Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!

Tim Rogers
  • 21,297
  • 6
  • 52
  • 68
  • Thanks, I'll take another look at SQL variants. My biggest concern is performance and I'm not sure how it would handle that, especially if we're talking about over 50mil rows – Rachel Feb 24 '11 at 15:21
  • Just found out sql_varients cannot be used with the LIKE clause... thats a huge downside to me. Of course, if I do create a view for each UDF then I could cast it to the appropriate datatype based on SQL_VARIANT_PROPERTY(value, 'BaseType')... still, seems like its bad for performance – Rachel Feb 24 '11 at 15:52
  • You can use LIKE, but you have to cast the value first. LIKE only works on varchars so you have to cast your sql_variant to a varchar. As long as you know what whether your UDF is a varchar (e.g. because the type is stored somewhere else) you can filter all your rows to varchars then cast and run your LIKE query: eg. select * FROM MyTable where variant_type = 'v' Cast(variant_value as varchar(max)) LIKE 'Blah%' This way, you're not converting ints and so on to strings which would slow you down. – Tim Rogers Feb 24 '11 at 16:34
  • I'd need to run some tests to see how performance is on that, especially with millions of rows. Know of any online articles about performance using sql_varients? Especially with casting and very large number of records? – Rachel Feb 24 '11 at 17:24
1

SharePoint uses option 1 and has reasonable performance.

Nathan DeWitt
  • 6,511
  • 8
  • 46
  • 66
1

I've managed this very successfully in the past using none of these options (option 6? :) ).

I create a model for the users to play with (store as xml and expose via a custom modelling tool) and from the model generated tables and views to join the base tables with the user-defined data tables. So each type would have a base table with core data and a user table with user defined fields.

Take a document as an example: typical fields would be name, type, date, author, etc. This would go in the core table. Then users would define their own special document types with their own fields, such as contract_end_date, renewal_clause, blah blah blah. For that user defined document there would be the core document table, the xcontract table, joined on a common primary key (so the xcontracts primary key is also foreign on the primary key of the core table). Then I would generate a view to wrap these two tables. Performance when querying was fast. additional business rules can also be embedded into the views. This worked really well for me.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Kell
  • 3,252
  • 20
  • 19
0

In the comments I saw you saying that the UDF fields are to dump imported data that is not properly mapped by the user.

Perhaps another option is to track the number of UDF's made by each user and force them to reuse fields by saying they can use 6 (or some other equally random limit) custom fields tops.

When you are faced with a database structuring problem like this it is often best to go back to the basic design of the application (import system in your case) and put a few more restraints on it.

Now what I would do is option 4 (EDIT) with the addition of a link to users:

general_data_table
id
...


udfs_linked_table
id
general_data_id
udf_id


udfs_table
id
name
type
owner_id --> Use this to filter for the current user and limit their UDFs
string_link_id --> link table for string fields
int_link_id
type_link_id

Now make sure to make views to optimize performance and get your indexes right. This level of normalization makes the DB footprint smaller, but your application more complex.

Wouter Simons
  • 2,856
  • 1
  • 19
  • 15
0

I would recommend #4 since this type of system was used in Magento which is a highly accredited e-commerce CMS platform. Use a single table to define your custom fields using fieldId & label columns. Then, have separate tables for each data type and within each of those tables have an index that indexes by fieldId and the data type value columns. Then, in your queries, use something like:

SELECT *
FROM FieldValues_Text
WHERE fieldId IN (
    SELECT fieldId FROM Fields WHERE userId=@userId
)
AND value LIKE '%' + @search + '%'

This will ensure the best possible performance for user-defined types in my opinion.

In my experience, I've worked on several Magento websites that serves millions of users per month, hosts thousands of products with custom product attributes, and the database handles the workload easily, even for reporting.

For reporting, you can use PIVOT to convert your Fields table label values into column names, then pivot your query results from each data type table into those pivoted columns.

Mark Entingh
  • 621
  • 6
  • 10