0

I would like to store a large amount of data in a relational database (postgresql on RDS in my case) but I am not sure what structure to use. I followed a database course including database normalization but I am still not sure about how to structure the database.

My data consists of tabular results of a climate model, aggregated by zones (water basins).

parameter in [temperature,precipitation .. etc.] length = 11
year in [1960 : 2014]
month in [1:12]
zone (basin) [1 : 16000]

Option 1

My first option is to create a separate table for each indicator and store the data as follows:

|  ID   | basin_id | temperature | unit | year | month | temporal_resolution |
|-------|----------|-------------|------|------|-------|---------------------|
|     1 |        1 |        42.1 | k   | 2000 |     1 | month               |
|     2 |        2 |        1.87 | k   | 2000 |     1 | month               |
|    .. |       .. |          .. | ..   |   .. |    .. | ..                  |
| 11001 |        1 |        40.3 | m3   | 2000 |     2 | month               |
| 11002 |        2 |         2.3 | m3   | 2000 |     2 | month               |

Option 2

and the second option creates a vertical table:

|  ID   | basin_id |  indicator  | value | unit | year | month | temporal_resolution |
|-------|----------|-------------|-------|------|------|-------|---------------------|
|     1 |        1 | temperature |  42.1 | k   | 2000 |     1 | month               |
|     2 |        2 | temperature |  1.87 | k   | 2000 |     1 | month               |
|    .. |       .. | ..          |    .. | ..   |   .. |    .. | ..                  |
| 11001 |        1 | precipitation |  40.3 | m3   | 2000 |     2 | month               |
| 11002 |        2 | precipitation |   2.3 | m3   | 2000 |     2 | month               |

My question is whether splitting up the indicator name and value is recommended or should not be used. If the data is stored vertically the total number of rows will be appr. 16000*11*12*55=116,160,000 which I am not sure is very manageable.

Option 3

EDIT: since the number of indicators is limited (12 or so), a vertical table structure is not required. A third option is merging the different indicator tables into something like:

|  ID   | basin_id | temperature_k | precipitation_m | …  | riverdischarge_m3 | year | month | temporal_resolution |
|-------|----------|---------------|-----------------|----|-------------------|------|-------|---------------------|
|     1 |        1 |          42.1 |            42.1 | …  |              42.1 | 2000 |     1 | month               |
|     2 |        2 |          42.1 |            42.1 |    |              42.1 | 2000 |     1 | month               |
|    .. |       .. |            .. |              .. | .. |                .. |   .. |    .. | ..                  |
| 11001 |        1 |          42.1 |            42.1 | .. |              42.1 | 2000 |     2 | month               |
| 11002 |        2 |          42.1 |            42.1 | .. |              42.1 | 2000 |     2 | month               |

This results in row_count of 16000 * 55 * 12 = 10,560,000

Rutger Hofste
  • 4,073
  • 3
  • 33
  • 44
  • 1
    What is the reason for splitting year&month ? What is the reason for an EAV-like `indicator` column ? (BTW: `unit` probably depends on it) – wildplasser May 23 '18 at 10:47
  • I could combine year + month in a date column and include the unit in the indicator column. Units depend on the aggregation method and can be m3 or m. The reason for an EAV-like indicator column is that a coworker recommended this option but I can't figure out whether that improves the database. – Rutger Hofste May 23 '18 at 11:12
  • 1
    You could even store different `indicator`s in separate tables... Anything goes! But basically, your *fact-table* needs three key-dimensions: {what,where,when} – wildplasser May 23 '18 at 11:29

2 Answers2

2

This appears to be an example of modeling inheritance in a relational database.

You have an abstract entity - "observation" - with properties zone, year and month, concrete entities "temperature_observation" with a "temperature" attribute, and "precipitation" entity with "cubic metres".

This SO question outlines the options available - none of them are particularly clean. Your option 1 is "table per subclass".

Your option 2 is not one of the common solutions; it's available to you because your data apparently has just 2 attributes - quantity (numeric) and unit of measurement. That's not a common situation.

Option 3 is "single table inheritance". It's a common design pattern, and generally works if you have a limited number of sub classes; once you get lots of subclasses, it becomes rather hard to understand.

The next thing you need to consider is how you will query this data. Is it a question of "return all records for a given period/basin"? In this case, it doesn't much matter - both your options are fine.

If you want to use the database for more complex queries -

in which month was the temperature highest and precipitation lowest?

what's the average temperature in basins where precipitation is at least x?

  • then you may want to favour "readability" of your queries.

In my opinion, option 1 is pretty clear - anyone looking at your database queries will understand, at a glance, what you're asking. You will be joining on the things that define the problem domain - basin, year, month.

Options 2 and 3 require self-joins, which may become rather complex and harder to read as the conditions become more complex.

For instance, the question what's the average temperature in basins where precipitation is at least x? in option 1 is:

select avg(temperature)
from temperature_facts
where basin_id in 
  (select basin_id
   from precipitation_facts
   where precipitation > ?)

In option 2, this becomes:

select avg(value)
from   facttable
where  indicator = 'temperature'
and    basin_id in 
  (select basin_id
   from fact_table
   where value > ?
   and indicator = 'precipitation')

In option 3, it's something like

select avg(temperature)
from fact_table
where basin_id in 
  (select basin_id
  from fact_table
  where precipitation > ?)

Personally, I find option 1 more expressive, but that's a question of preference.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
1

Since we too do marine research, I understand your needs (maybe that is data coming from a CTD). To my experience:

  • Instead of date\datetime field, using separate year, month, day, hour fields were of value (resolution hardly ever got lower than hour).
  • For different probe sets used a different set of data was coming in the format of a delimited text file (with header lines at top). Importing that and later doing some validations, transformations using a field for each parameter made sense.
  • OTOH there were cases, where a model needed too much data of the same type, I have observed using non-traditional ways to store the data proved to be much faster. ie: Within a date range (say last 5 years), a certain parameter is measured with a fixed period of a second. That is 86400 measurements of that parameter per day. Multiply it by 5 years and you end up something like near to 158 million values and that is only for a location. If you need to store that for 20 locations then you need over 3 billion rows of data if you do that in traditional ways. If it didn't affect the storage too, I would say that this is the problem of ETL and analysis afterwards. With traditional approach if you do this:

    LocationID (int), Datetime (which would be incrementing per second), Value

    it looks innocent but just the storage would be more than 30 Gb. So instead I took another approach and stored:

    LocationId, date (not datetime), oneDayWorthValues (as binary, per second fixed, if value range fits in 2 bytes then each row had exactly 86400*2 bytes binary data) - the database I used then didn't have array datatype.

Just my 0.05 cents but maybe worthwhile if you would also do something similar (rare with water data).

Cetin Basoz
  • 22,495
  • 3
  • 31
  • 39