0

I have timestamped geolocation data and some other information about users and I am looking for advice for designing the database. I guess a naive design would be:

CREATE TABLE user(
    user_id INT NOT NULL,
    name VARCHAR(128) NOT NULL,
    gender VARCHAR(128) NOT NULL,
    age INT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    xloc FLOAT(4) NOT NULL,
    yloc FLOAT(4) NOT NULL,
    PRIMARY KEY(user_id),
);

Here xloc, yloc are floats indicating location. The obvious problem with this table is that fields gender, age, and name will be redundantly repeated many times for each timestamp. After reading the very comprehensive accepted answer in Storing time-series data, relational or non? I decided that a better solution would be to have geolocation data in separate table, i.e. have two tables:

CREATE TABLE geodata(
    user_id INT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    xloc FLOAT(4) NOT NULL,
    yloc FLOAT(4) NOT NULL,
    PRIMARY KEY (user_id, time),
);

CREATE TABLE user(
    user_id INT NOT NULL,
    name VARCHAR(128) NOT NULL,
    gender VARCHAR(128) NOT NULL,
    age INT NOT NULL,
    PRIMARY KEY (user_id),
);

Note that in the geodata table I am using both user_id and time as PK to try to conform to the Sixth Normal Form (6NF) as suggested by the answer in the above link -- this presumably enables higher performance. Strictly speaking though, 6NF requires only one other attribute for each PK , but in my case I have two (xloc and yloc). Recent PostgreSQL versions allow one to use array types, so another option would be:

CREATE TABLE geodata(
    user_id INT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    loc FLOAT(4) ARRAY[2] NOT NULL,
    PRIMARY KEY (user_id, time),
);

In this case, the client should know that the array represents x and y locations, in that order, but this is not an issue for now. The table now technically has just one attribute per PK, but I am more interested in its performance. I'm new to Postgres and DBs in general. Would using array types be better in terms of performance?

Data and use case: The location timeseries for each user can be tens of millions of measurements long, and at varying intervals. read ops will outnumber write ops -- in fact right now my data is static and the resulting database will be used by a small team for statistical analyses, at least for now. My queries will be e.g. measurements for male users, or Sunday measurements for users under 30s.

What alternative designs would you recommend?

Michael
  • 1,834
  • 2
  • 20
  • 33
  • 1
    @a_horse_with_no_name [tag:relational-database] is certainly relevant since "6NF" has no meaning outside that. – philipxy May 15 '19 at 03:26

1 Answers1

1

Time series & temporal data do not per se use 6NF. (Pace that link.) What is needed is a CK & associated data that you want to record an atomic change to. 6NF is just frequently wanted but it is not per se the goal. The non-CK data could be multiple columns--you want to record changes to a location, not to a coordinate. (Similarly when you want to know whether an integer changed no one's disturbed that you don't have a table for every CK & digit.) You can think of that as a transformation of a 6NF table with CK & one tuple- or record-valued column.

So here a design with a CK & X & Y is fine--as long as you don't need to know when a particular coordinate value changed.

"I'm new to Postgres and DBs in general." Then forget about "performance" until you learn enough to know what it means. Make straightforward designs. Next learn about constraints & indexes.

Re temporal data (including 6NF) everyone should read Date, Darwen & Lorentzos. Avoid Snodgrass.

PS PKs are irrelevant to relational model theory, CKs matter & a PK is just some CK you called the PK. PS Beware that an SQL PK is more or less a superkey not CK; it could contain a smaller UNIQUE/superkey.

PS 6NF means satisfies no non-trivial JDs. It implies "Primary Key, and at most one other attribute" but the latter is not a definition of 6NF. Also note, that condition doesn't itself mean exactly one CK; there could be more.

PS Wikipedia is not a sound source for relational model information. Eg There's no one "1NF" & they are orthogonal to normalization to NFs leading to 6NF. Eg PKs don't matter. Eg normalization to higher NFs is not done by moving through lower NFs. (Moreover it can preclude good destination-NF designs.) Eg DKNF does not belong on that page among the NFs leading to 6NF. Eg Its definition of 6NF is wrong.

philipxy
  • 14,867
  • 6
  • 39
  • 83
  • Thank you very much for your input. I agree that first making it work and then focusing on performance is the right approach. But in my case performance will be extremely important for the team doing the querying and analyses etc. Right now we have our hundreds of thousands of measurement files lying in a folder and we just load the files directly in batches for analyses, etc. The team won't use the DB if it is substantially slower. – Michael May 15 '19 at 11:04
  • So if I understand correctly, I should be using CK & (X,Y) instead of CK & X & Y (i.e. array types for X,Y instead of separate columns), right? Each location measurement is an atom `(x,y)`, rather than `x` and `y` being independent things. I will most certainly never be querying just `x` or just `y` columns. – Michael May 15 '19 at 11:07
  • 1
    I said use CK columns & columns X & Y. I thought that was clear from "a CK & X & Y is fine--as long as you don't need to know when a particular coordinate value changed". Re an array I explained there is no *reason* & [it is generally ill-advised](https://stackoverflow.com/q/3653462/3404097). Re performance see my next comment. PS Your decomposition to 2 tables is *obviously* desired independent of your application. Time to read a textbook on information modeling, the relational model & DB design. Then on optimizing for your DBMS. – philipxy May 15 '19 at 16:32
  • 1
    My current generic comment re "better"/"best" etc: There's no such thing as "better"/"best" in engineering unless you define it. Also unfortunately all reasonable practical definitions require a ridiculous amount of experience with a ridiculous number of factors that interact with chaotic sensitivity to details. Make straightforward designs. When you demonstrate via measurement that a design and all alternatives you can think of have problems (whatever that means at the time), then ask a very specific question. Which should also define "better"/"best". https://meta.stackexchange.com/q/204461 – philipxy May 15 '19 at 16:32
  • 1
    "a CK & X & Y is fine--as long as you don't need to know when a particular coordinate value changed" -- this is the part I don't understand, shouldn't it be the opposite? If I have separate columns for `x` and `y`, am I not telling the database that these are independent attributes? – Michael May 15 '19 at 18:38
  • I guess my point is: even if a person is moving just in one axis (say only in `y`), each location should be considered as atomic because different locations are *physically different in space*. A person doesn't just change "one of its coordinates", but instead is actually moving in space. In fact if we change coordinate system both coordinates might change for the same physical movements. – Michael May 15 '19 at 19:13
  • I looked into the link provided and Im not sure if it applies here? I can see why variable-length lists of variable types can be a problem in performance, but in my case I have fixed-length (2) array of floats. I would imagine that DBs store tables row by row, but I read somewhere that it might decide to store the columns separately (as indexes?) for optimization, i.e. `x`s and `y`s in separate arrays. This is not desired as I will *always* be querying tuples `x,y` (wouldn't make sense to just know one coordinate of the location). Hence the question about using array type to atomize them. – Michael May 15 '19 at 19:17