0

I am about to design a database for which I will have frames from multiple data sources arriving, at different times. Each of these frames will need to be dissected and converted into engineering units for storage into the database with an appropriate data type (e.g. FLOAT, INT). Accordingly, the general structure would be a DATETIME datatype for a primary key and the remaining columns would be each of these converted values. This would keep everything aligned to the proper time of receipt.

For most of my tables (representing each data source/frame) the column counts are relatively small, several hundred columns. However there will be one that looks to have approximately 10,000 columns. From previous experience, I know that there are both hard limits for the maximum number of columns (4,096) as well as the the complete row size (65,536 bytes).

Previously, I would have split this up into "adjacent" tables, like tableA_000, tableA_001, etc with each holding some arbitrary number of columns and yet still all using the same primary key (time). But doing this becomes a bit messy, namely that I have to create a separate lookup table to first reference when I want to find something. (For example, "colVoltage is in tableA_007", "colTimer is in tableB_003") I can certainly do this again in this manner, but I'm curious ... does anybody else have any different strategies for handling this? In an ideal world, I would be able to both insert to and select from what (to me) looks like one big table despite having so many columns.

MariaDB 10.11.3 / CentOS 8

  • 1
    This sounds like you shouldn't be using multiple columns at all, an attribute-value table would be more appropriate. – Barmar Jun 08 '23 at 15:30
  • *approximately 10,000 columns* I cannot imagine 10000 **unique** properties. – Akina Jun 08 '23 at 15:54
  • re 10,000 columns ... imagine a frame/packet/transmission of maybe 250kB in size, chock-full of embedded data. This is what I am extracting and stuffing into a single row in the database. – user3120127 Jun 08 '23 at 16:09
  • re attribute-value table, I am unfamiliar ... I've got some learning to do, thanks! – user3120127 Jun 08 '23 at 16:11
  • Well... and what do you do with these columns values? save-store-retrieve? or you use them for filtering, calculate sum/min/average and so on? – Akina Jun 08 '23 at 18:12
  • Most typical usage is - after insertion - retrieval for charting or exporting to file. This could include reduced/summarized (like average) data. – user3120127 Jun 08 '23 at 19:27
  • It is a bit difficult to recommend alternative data structures if we do not know what the data looks like. You are getting the common answers (EAV, blob column with intrinsic structure), but in reality, we are just guessing. I would venture that a relational database may not even be a good fit for such data. – Shadow Jun 09 '23 at 09:48

1 Answers1

1

Personally I think this sounds like a use case for a kind of document store. Creating tables with so many columns is likely to become unmanageable quite quickly. You can do this in MySQL using the JSON data type which gives you a very flexible data structure within a single table. You can also use generated columns to create fast indexes on data elements within your data that are common.

I'd also question why you're using a DATETIME as a primary key, surely there's a potential for clashes? This would be my create statement:

CREATE TABLE frames (   
  id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  type VARCHAR(50) NOT NULL,
  data JSON NOT NULL,
  example_key INT GENERATED ALWAYS AS (data->"$.example_key"),
  INDEX(created),
  INDEX(type),
  INDEX(example_key)
); 

The MySQL docs are good on this, at https://dev.mysql.com/doc/refman/8.0/en/json.html and https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html#json-column-indirect-index

Obviously there are other database engines that are designed exclusively to function in this way which might be a better choice if this is the only function of your app and doesn't need a relational DB for other bits. Also bear in mind that the JSON type can be a bit slow, because if you're SELECTing you end up pulling out the entire contents of that field for each record returned. Again generated fields can be good for this.

james-geldart
  • 709
  • 7
  • 9
  • Thanks for the suggestions, great approach to think about. What other database engines would you recommend? I've admittedly had my head down (and been attached to MySQL) for so long! – user3120127 Jun 08 '23 at 15:55
  • The whole-row approach does give me a bit of pause, as you mentioned, in that I'm frequently fetching (what would have been previously) a single column from the dataset, over a time period for things like charting. That could result in something like 5k rows being returned (with all data in this approach), just to then extract a small sliver from each. – user3120127 Jun 08 '23 at 16:01
  • I'm also a bit MySQL or nothing, but it's probably worth looking at things like MongoDB, CouchDB and ElasticSearch. Also if you use AWS then DynamoDB might be worth a look – james-geldart Jun 09 '23 at 14:02