-4

I have experiment data to store. The experiment can be run with varying settings, let's say setting a and setting b which can be set to varying values. The experiment is run 100.000 times per setting. How to optimize my table for this data structure?

Naïve implementation is a table with column names setting a, setting b and result of experiment and to add the data per row. Setting a could for instance be set to values (1, 2, 3) and setting b to (0.1, 0.01) :

(_ROWID_), setting a, setting b, result
(0,) 1 0.1 res_1
(1,) 1 0.1 res_2
(2,) 1 0.1 res_3
... 
(n,) 2 0.1 res_n
(n+1,) 2 0.1 res_n+1
... 
(k,) 3 0.1 res_k
(k+1,) 3 0.1 res_k+1
... 
(l,) 1 0.01 res_l
(l+1,) 1 0.01 res_l+1
... etc.

Indices l > k > n > 0 are primary key auto-incrementing row id's. Sub optimal, but easy to implement. I found data retrieval is slow. This implementation is neither good for memory (since setting a and setting b are duplicated many times so should be pointers to a set of results), nor good for search performance (table is unordered, while searches should return ordered data).

Another option is a child table for each set of setting a and setting b (named table_a_b) and a parent table pointing to each child table; a "table of contents". Then result of experiment can be stored without having to reference setting a and setting b. This makes queries more difficult however.

Or clustered indexes, which form a b-tree which does a similar job, if I understand correctly.

What are benefits and drawbacks of the three mentioned options for use case I described?

user4157124
  • 2,809
  • 13
  • 27
  • 42
ljbkusters
  • 165
  • 7
  • 1
    With this much information it is not possible to make an assumption. There are different viewpoints based on what and how you would acquire and moreover SQLite might not be the right database for the job. At a quick thought, why SettingA, SettingB columns and not simply a single Setting column with another specifying the type. It would be the common sense SQL structure (TestRun, Setting, Result). Depending on the real data, data structures that doesn't look like normal for an SQL table might be preferred (ie: a typical example is sensor readings every second from N locations). – Cetin Basoz Nov 07 '22 at 11:01
  • Please ask 1 (specific researched non-duplicate) question. [ask] [Help] [research effort](https://meta.stackoverflow.com/q/261592/3404097) [mre] [Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) – philipxy Nov 09 '22 at 06:20

1 Answers1

1

First of all, if your settings are really as straightforward as your example shows (a handful of numbers) the data design you have now is perfectly serviceable. And it has the great advantage of being very simple. Yes, you have some duplication of numbers. But your table is not going to overwhelm a modern machine's RAM and SSD/hard drive even with millions of rows. If you get to the point to where looking stuff up in the table is too slow, you can add indexes. (Simple is good! Data interpretation errors stemming from overly complex data storage schemes are bad!)

But, you are right that it's suboptimal. You are on the right track. SQL data design is about entities and relationships. You have, it seems to me, two entities in your physical world. One is setting and the other is result. The relationship between your entities is this: Zero or more results per setting.

So make yourself a settings table with one row per setting, containing colunns for its attributes. Those are a and b in your example.

CREATE TABLE settings (
  settings_id  INTEGER PRIMARY KEY, 
  a            INTEGER NOT NULL,
  b            REAL NOT NULL,
  UNIQUE (a, b)
);

The UNIQUE line prevents you from creating duplicate rows in this table. That may be desirable, or may not. (Only you know that). If it's not desirable leave it out of your table definition.

Then make yourself a results table like this. You'll have a column for each observation that forms part of the result. If I were you I'd add a result_time column so you have a way of knowing when you made your observations. The second column contains the settings_id of the settings.

CREATE TABLE results (
  results_id     INTEGER PRIMARY KEY, 
  settings_id    INTEGER,
  results_time   DATETIME,
  obs1           REAL,
  obs2           TEXT,
  obs3           INTEGER
);

Then once these tables are populated you can run queries like this.

SELECT settings.settings_id, settings.a, settings.b,
       COUNT(results.results_id) number_of_runs,
       MIN(results.results_time) first_result_time,
       MAX(results.results_time) last_result_time,
       AVG(results.obs1) mean_obs1
  FROM settings
  JOIN results ON settings.settings_id = results.settings_id
 GROUP BY settings.settings_id, settings.a, settings.b

(Of course, you can run similar queries with your present data design.)

Populating these tables is the trick. As you insert each new results row, you'll have to use some code to ...

  • Look up the settings_id of the appropriate settings row.
  • If it doesn't exist insert it and get its id with the last_insert_rowid() function,
  • Then use that id to insert your results row.

You didn't mention any programming language you use (python? Java?) so it's hard to give you specific advice about doing that.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thank you for taking the time to write such an explicit answer. I will have a look at this later today to see if I can implement your suggestions. I'm using Python, but I'll manage on my own with the implementation part. – ljbkusters Nov 07 '22 at 13:20