0

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.

I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.

I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Replacing subrows by other subrows (that supposedly identify them better than they identify themselves) is not DB normalization. – philipxy Dec 11 '22 at 05:39

1 Answers1

0

That's a basic step of database normalization.

If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:

BEGIN;  -- wrap in single transaction (optional, but safer)

CREATE TABLE people (
  people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);

INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM   tbl
ORDER  BY 1, 2;  -- optional

ALTER TABLE tbl RENAME TO tbl_old;  -- free up org. table name

CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM   tbl_old t
JOIN   people p USING (first_name, last_name);
-- ORDER BY ??

ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);

-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;

COMMIT;

Related:

DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Replacing subrows by other subrows (that supposedly identify them better than they identify themselves) is not DB normalization. Also that Wikipedia article wrongly associates DKNF with the sequence of NFs from relations & a 1NF up through 5NF & 6NF. – philipxy Dec 11 '22 at 05:40