3

I've some data set, which has hundreds of parameters (with more coming in)

  1. If I dump them in one table, it'll probably end up having hundreds of columns (and I am not even sure how many, at this point)
  2. I could do row based, with a bunch of meta tables, but somehow row based structure feels unintuitive
  3. One more way would be to keep column based, but have multiple tables (split the tables logically) which seems like a good solution.

Is there any other way to do it? If yes, could you point me to some tutorial? (I am using mysql)

EDIT: based on the answers, I should clarify one thing - updates and deletes are going to be much lesser, than inserts and selects. as it is, selects are going to be the bulk of the operations, so selects have to be fast.

  • It sounds like #2 is the best option, so that you can join your tables with SQL statements. Can you post an example of your proposed schema (or just the fields you plan on using)? –  Jan 26 '12 at 14:37
  • [Joins are not as expensive as a lot of people seem to think.](http://stackoverflow.com/questions/173726/when-and-why-are-database-joins-expensive) – spraff Feb 01 '12 at 10:08

3 Answers3

2

I ran across several designs where a #4 was possible:

  • Split your columns into searchable and auxiliary
  • Define a table with only searchable columns, and an extra BLOB column
  • Put everything in one table: searchable columns go as-is, auxiliary go as a BLOB

We used this approach with BLOBs of XML data or even binary data, representing the entire serialized object. The downside is that your auxiliary columns remain non-searchable for all practical purposes. The upside is that you can add new auxiliary columns at will without changing the schema. You can also make schema changes to make previously auxiliary columns searchable with a schema change and a very simple program.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
1

It all depends on the kind of data you need to store.

If it's not "relational" at all - for instance, a collection of web pages, documents, etc - it's usually not a good fit for a relational database.

If it's relational, but highly variable in schema - e.g. a product catalogue - you have a number of options:

  • single table with every possible column (your option 1)
  • "common" table with the attributes that each type shares, and joined tables for attributes for subtypes
  • table per subtype

If the data is highly variable and you don't want to make schema changes to accommodate the variations, you can use "entity-attribute-value" or EAV - though this has some significant drawbacks in the context of relational database. I think this is what you have in mind with option 2.

If the data is indeed relational, and there is at least the core of a stable model in the data, you could of course use traditional database design techniques to come up with a schema. That seems to correspond with option 3.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
  • it is relational - not documents type data. so I am not sure if mongodb like databases would work. –  Jan 26 '12 at 15:25
  • Is the schema predictable in advance, and fixed? Or does the schema change over time? – Neville Kuyt Jan 26 '12 at 16:09
0

Does every item in the data set have all those properties? If yes, then one big table might well be fine (although scary-looking).

On the other hand, perhaps you can group the properties. The idea being that if an item has one of the properties in the group, then it has all the properties in that group. If you can create such groupings, then these could be separate tables.

So should they be separate? Yes, unless you can prove that the cost of performing joins is unacceptable. Perform all SELECTs via stored procedures and you can denormalise later on without much trouble.

spraff
  • 32,570
  • 22
  • 121
  • 229
  • the thought of having one table with 500-1000 columns really scares me. I mean, if I have to run a query and check something from the command line, even that would be near impossible. using stored procedures is a good idea, I'll look into it. Good thing is, I'm starting from scratch, so there is some freedom in choosing the architecture. –  Jan 26 '12 at 15:23
  • It's not the size of the table that scares me, it's the size of the problem that the table is solving! Data design is more important than code design. Normalise your data by default, denormalise only when forced to. – spraff Jan 26 '12 at 15:28