1

I am performing large scale wind simulations to produce hourly wind patterns over a city. The results is a time series of 2-dimensional contours. Currently I am storing the results in SQLite3 database tables with the following structure

Table: CFD
id, timestamp,           velocity, cell_id
1 , 2010-01-01 08:00:00, 3.345,    1 
2 , 2010-01-01 08:00:00, 2.355,    2 
3 , 2010-01-01 08:00:00, 2.111,    3 
4 , 2010-01-01 08:00:00, 6.432,    4 
.., ..................., .....,    .
1000 , 2010-01-01 09:00:00, 3.345,    1 
1001 , 2010-01-01 10:00:00, 2.355,    2 
1002 , 2010-01-01 11:00:00, 2.111,    3 
1003 , 2010-01-01 12:00:00, 6.432,    4 
.., ..................., .....,    .

Actual create statement:

CREATE TABLE cfd(id INTEGER PRIMARY KEY, time DATETIME, u, cell_id integer)
CREATE INDEX idx_cell_id_cfd on cfd(cell_id)
CREATE INDEX idx_time_cfd on cfd(time)

(There are three of these tables, each for a different result variable)

where cell_id is a reference to the cell in the domain representing a location in the city. See this picture to have an idea of what it looks like at a specific timestep.contour at timestep

The typical query performs some kind of aggregation on the time dimension and group by on cell_id. For example, if I want to know the average local wind speed in each cell during a specific time interval, I would execute

select sum(time in ('2010-01-01 08:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00', ...................., ,'2010-12-30 18:00:00','2010-12-30 19:00:00','2010-12-30 20:00:00','2010-12-30 21:00:00') and u > 5.0) from cfd group by cell_id

The number of timestamps can vary from 100 to 8,000.

This is fine for small databases, but it gets much slower for larger ones. For example, my last database was 60GB, 3 tables and each table had 222,000,000 rows.

Is there a better way to store the data? For example:

  • would it make sense to create a different table for each day?
  • would be better to use a separate table for the timesteps and then use a join?
  • is there a better way of indexing?

I have already adopted all the recommendations in this question to maximise the performance.

Rojj
  • 1,170
  • 1
  • 12
  • 32

1 Answers1

2

This particular query is hard to optimize because the sum() must be computed over all table rows. It is a better idea to filter rows with WHERE:

SELECT count(*)
FORM cfd
WHERE time IN (...)
  AND u > 5
GROUP BY cell_id;

If possible, use a simpler expression to filter times, such as time BETWEEN a AND b.

It might be worthwhile to use a covering index, or in this case, when all queries filter on the time, a clustered index (without additional indexes):

CREATE TABLE cfd (
    cell_id INTEGER,
    time DATETIME,
    u,
    PRIMARY KEY (cell_id, time)
) WITHOUT ROWID;
CL.
  • 173,858
  • 17
  • 217
  • 259
  • The problem with the first query is that it does not return all the `cell_id`, I wrote my query a while ago and that was the reason I used `sum` rather than `count`. The covering index helps quite a bit although I keep having apparently random execution times with differences from 2 to 3 times the minimum query time. Testing the `clustered index` now. – Rojj Oct 03 '17 at 07:03
  • Also, I cannot use `BETWEEN' as times are not necessarily consecutive. – Rojj Oct 03 '17 at 07:08
  • 1
    If you want to get all cells, then the `sum` approach is the best one. But this requires reading all rows from the table, which makes it even more important to have a covering or clustered index so that all the data can be read in the correct order withour seeking. – CL. Oct 03 '17 at 07:11