I have timestamped geolocation data and some other information about users and I am looking for advice for designing the database. I guess a naive design would be:
CREATE TABLE user(
user_id INT NOT NULL,
name VARCHAR(128) NOT NULL,
gender VARCHAR(128) NOT NULL,
age INT NOT NULL,
time TIMESTAMPTZ NOT NULL,
xloc FLOAT(4) NOT NULL,
yloc FLOAT(4) NOT NULL,
PRIMARY KEY(user_id),
);
Here xloc, yloc
are floats indicating location. The obvious problem with this table is that fields gender
, age
, and name
will be redundantly repeated many times for each timestamp. After reading the very comprehensive accepted answer in Storing time-series data, relational or non? I decided that a better solution would be to have geolocation data in separate table, i.e. have two tables:
CREATE TABLE geodata(
user_id INT NOT NULL,
time TIMESTAMPTZ NOT NULL,
xloc FLOAT(4) NOT NULL,
yloc FLOAT(4) NOT NULL,
PRIMARY KEY (user_id, time),
);
CREATE TABLE user(
user_id INT NOT NULL,
name VARCHAR(128) NOT NULL,
gender VARCHAR(128) NOT NULL,
age INT NOT NULL,
PRIMARY KEY (user_id),
);
Note that in the geodata
table I am using both user_id
and time
as PK to try to conform to the Sixth Normal Form (6NF) as suggested by the answer in the above link -- this presumably enables higher performance. Strictly speaking though, 6NF requires only one other attribute for each PK
, but in my case I have two (xloc
and yloc
). Recent PostgreSQL versions allow one to use array types, so another option would be:
CREATE TABLE geodata(
user_id INT NOT NULL,
time TIMESTAMPTZ NOT NULL,
loc FLOAT(4) ARRAY[2] NOT NULL,
PRIMARY KEY (user_id, time),
);
In this case, the client should know that the array represents x
and y
locations, in that order, but this is not an issue for now. The table now technically has just one attribute per PK, but I am more interested in its performance. I'm new to Postgres and DBs in general. Would using array types be better in terms of performance?
Data and use case: The location timeseries for each user can be tens of millions of measurements long, and at varying intervals. read ops will outnumber write ops -- in fact right now my data is static and the resulting database will be used by a small team for statistical analyses, at least for now. My queries will be e.g. measurements for male users, or Sunday measurements for users under 30s.
What alternative designs would you recommend?