Which is the best practice to store a huge number (10000+) of DIFFERENT object types into a database?

Question

When designing a new relational database, normally each object type is represented by a corresponding table. Which is the best practice to design a database, which stores a huge number of DIFFERENT object types in order to avoid to create and maintain thousands of database tables? Which better alternatives to a relational database exist for this case?

It's a common misconception about relational databases that tables represent object types, it's actually an idea that came from the network data model. In proper relational databases, each table represents a fact type that can involve any number of object types in different roles. — reaanb, Feb 20 '17 at 14:54
If someone were to quote source and claim it to be credible or official to answer this question, I would be inclined to dismiss the source as NOT credible simply because a system with 10K+ object types is an anti-pattern. There are no best practices other than to find a way not to have so many object types to maintain. Some of the answers have some suggestions to accomplish that. — Douglas Daseeco, Feb 27 '17 at 03:37

Douglas Daseeco · Answer 1 · 2017-02-27T15:45:21.950

The answer depends much on the nature of the distinctions between the thousands of object types and to what degree and in what ways they can be classified and possibly generalized further. Discovery is the key to a maintainable design in scenarios such as this.

Here are some potential persistence options that may work for your set of object types. It will take some thought to consider the pros and cons of each.

Discover a hidden structure or pattern in the object types allowing them to be decomposed ^1,2,3.
Discover categories of object types to which (1) can then be applied.
Map multiple objects to a single or smaller set of tables or document types.
Map the objects one to one and determine a meta scheme to maintain them affordably.

Whether the database is relational or not, how it is structured, what type of search features are available, and how keys are implemented is a decision that should be made subsequent to the above discovery. That is the best practice.

Determining the structure of the data in such a way that storage, maintenance, and retrieval have the desired characteristics cannot be answered in a 500 page book adequately, thus certainly not a short answer.

Learning the pros and cons of these potential choices would be a good start. You can web search these persistence philosophies by their names and the words "database" or "persistence" to see descriptions and vendor products that correspond.

Relational table
Relational object
Tabular non-relational
Mapping (key and value)
Mapping (key and fixed record payload)
Document (free text)
Hierarchical
Graph (network of edges connecting vertices)
Multidimensional (OLAP and others)

You may find that the reason you have thousands of data types is that they correspond to document types and the only thing in common between them is the human language they are written in or possibly not even that. Perhaps they are arbitrary locale, in which case internationalized document storage systems are the options to examine first.

You may find that there is a set of semantic rules that 9,800 of your 10,000+ object types confirm to, in which case the characterization and specification of the rules may lead to a more granular storage scheme ^4,5,6. Formalization of the semantic structure in conjunction with a structural software design project (such as the composite or decorator pattern) may permit a gross reduction in the number of object types.

Such refactoring can easily be worth the time and may get your project up to speed in a fraction of the time.

Upon the discovery of additional structure, you then will need to determine what level of normalization makes sense for your store, update, retrieval, and disk footprint requirements.

Literature (all over the web) on normalization and denormalization will help you understand trade-offs between space, speed of writing, and speed of reading ^7,8.9. If a large amount of data is stored each day, the ETL characteristics will also play into the design significantly.

The selection of vendor and product is probably the last thing you will do architecturally before you start low level design and implementation and test framework construction. (That is another challenge with so many data types. How will you test 10,000+ classes adequately?)

Giving you narrower recommendations than this would be irresponsible without more characterization of the thousands of object types and why there are so many.

References

[1] https://www.tutorialspoint.com/design_pattern/design_pattern_quick_guide.htm

[2] https://sourcemaking.com/design-patterns-and-tips

[3] https://sourcemaking.com/design_patterns/strategy

[4] https://www.cs.cmu.edu/~dunja/LinkKDD2004/Jure-Leskovec-LinkKDD-2004.pdf

[5] https://archive.org/details/Learning_Structure_and_Schemas_from_Documents

[6] https://www.researchgate.net/publication/265487498_Machine_Learning_for_Document_Structure_Recognition

[7] http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm

[8] http://www.ovaistariq.net/199/databases-normalization-or-denormalization-which-is-the-better-technique/#.WLOlG_ErLRY

[9] https://fenix.tecnico.ulisboa.pt/downloadFile/3779571831168/SchemaTuning.ppt

Bohemian · Answer 2 · 2017-02-27T00:12:50.367

Use a NoSql database (Lucene, Mongo, Cassandra, Solr, Elastic search, Hadoop, etc), which stores documents that can have any number of fields (think Maps of key/values). In relational database terms, it's like each "row" can have a different row definition. I have implemented exactly this in the past and I found it handy to store a class field so I could reconstruct the correct type of object (in my case Java, but would apply to any language).

You could also use a relational database that supports JSON column type (eg Postgres) and serialise/deserialise your objects to/from JSON and store them in a JSON typed column. To make a convenient one-table solution, you would probably want a column that stores the object type to make deserialisation simple. I have also implemented this option and it worked for me.

Both options are good. The first one is a better technology fit. The second one can be less mysterious if you're already familiar with RDBMSs.

What you don't want to do is use any ORDBMs solution, where each object type has a dedicated table with columns matching the class fields. It's painfully rigid if you ever change the definition of your class, and totally unscaleae if the number of different classes grows past even a very small number.

score 2 · Answer 3 · edited May 23 '17 at 11:53

"Best practice" is subjective, and often used as a way of presenting personal preferences as being in some way authoritative.

So, here's my personal preference...

You have to do the analysis work. Is your data relational - can you say that there are entities and relationships? If so - create a relational schema. You may have to deal with inheritance relationships - this is something the traditional relational model doesn't deal with particularly well, but there are a number of possible solutions.

Are the objects you discuss not really relational? Do they have varying attributes, or are they made up mostly of unstructured data? Are the relationships primarily hierarchical? Are you really talking about time-series data, or geographical objects? In this case, you may be better served by one of the many NoSQL solutions.

Is the data "read-write", or "read-only"? Are you building a big data repository for reporting and analysis? If so, you may want to use an OLAP/BI database solution, rather than a relational schema.

Do you have extreme scalability or performance requirements? If so, where - read, write, analysis? If so, you may need to consider a highly denormalized data model.

score 0 · Answer 4 · answered Feb 25 '17 at 09:26

Am sure when you say 10000+ object types, it goes beyond primitive types like int, float etc. and even complex known types as graph, etc.

You can't use relational database as for example storing simple graph will require designing custom relations and tables. So only option left is to use Key-Value NoSQL databases where any object type will be serialized to document and stored with object id

score 0 · Answer 5 · answered Feb 26 '17 at 23:07

An alternative you may consider regardless of the type of database is to store your data as a JSON string. This way the data stored can be as dynamic as needed and can be changed freely. The drawbacks to this include being limited to server and client-side JSON handlers that would do all the "heavy-lifting" of queries, parsing, and otherwise relating data.

Like others have said NoSQL databases sound like what you are looking for in terms of avoiding relational databases' structural requirements.

score 0 · Answer 6 · answered Feb 27 '17 at 03:32

The distinction should be made between object types, object features, object attributes, and object instances.

No system should ever have 10,000+ object types. The maintenance of such a body of source code would be horrendous. Instead, determine how to have between 10 and 100 object types and use features and attributes to model those things that differ.

Even if you start with an entity relationship diagram or design first (designing from the back end forward) you should limit the number of data types to 100 and provide normalized or denormalized schemas to represent attributes, features, and relationships between your decomposed objects.

You may wish to take a look at software design patterns to get some ideas.

Which is the best practice to store a huge number (10000+) of DIFFERENT object types into a database?

6 Answers6