Problem
I've been asked to copy a fair bit of data to Postgres in a new table. The data contains assembly component lists, simplified in the table definition below:
CREATE TABLE IF NOT EXISTS assembly_item (
id uuid NOT NULL DEFAULT NULL,
assembly_id. uuid. NOT NULL DEFAULT NULL,
done_dts timestamp NOT NULL DEFAULT 'epoch',
CONSTRAINT assembly_item_pk
PRIMARY KEY (id)
);
There are dozens of attributes in the original, and a few hundred million rows at the moment. These records are spread across several installations, and are not stored in Postgres locally. Insertions on this table add up quickly, and it's going to grow to 1B rows within the year, is the guess. Date is rarely updated, and never deleted. (It could happen in time, but not often.) The same id
is never duplicated with different assembly_id
values. So, unique at the partition level on id
is safe. The goal here is to offload this data onto Postgres, and leave only very recent data in a cache on the local servers.
This looks like a natural candidate for partitioning, and I'm looking for some guidance on a sensible strategy. You can see from the simplified structure that we've got a unique row id
, a parent assembly_id
, and a timestamp. I've looked at the existing queries in the original database, and the main search field is assembly_id
, the parent record identifier. The cardinality between assembly
and assembly_item
is around 1:200.
In order to make partitioning most useful, it seems like the data needs to be split based on a value that enables the query planner to prune out partitions intelligently. I've thought of a few ideas, but don't have the 200M rows to test agains yet. In the meantime, what I'm considering is:
Partition by month using either
RANGE
, or byLIST
onYYYY-MM
of thedone_dts
. Rewrite all of the queries to scope by date range.Partition by
HASH
against the first two characters ofassembly_id::text
, giving me 256 partitions with fairly equal sizes. I think that this lets us search onassembly_id
and prune out many partitions that won't have matches, but it looks pretty weird when I set it up.
I appreciate that I'm asking a somewhat speculative question, all I'm hoping for here are some pointers that might make my first attempt more successful. Once I've got a bit data set, I can experiment more directly.
I've included experimental setup code, with only a sampling of the partitions listed for brevity.
Sample setup using a LIST
partition
------------------------------------
-- Define table partitioned by list
------------------------------------
-- Could alternatively use RANGE here to partition by month.
BEGIN;
-- Drop parent table, if they exists.
-- This destroys ALL partitions automatically, even without a CASCADE clause.
DROP TABLE IF EXISTS assembly_item_list CASCADE;
CREATE TABLE IF NOT EXISTS assembly_item_list (
id uuid NOT NULL DEFAULT NULL,
assembly_id uuid NOT NULL DEFAULT NULL,
assembly_done_dts timestamp NOT NULL DEFAULT 'epoch', -- Copied in from assembly.done_dts when rows are pushed to Postgres.
year_and_month citext NOT NULL DEFAULT NULL, -- YYYY-MM from assembly_done_dts, calculated in insert function. Can't use a generated column as a partition key.
-- Reminder: id values come from the various source tables in IB. The upsert writes over matches ON CONFLICT with this ID.
-- Note: You *must* include the partition key in the primary key. It's a rule.
CONSTRAINT assembly_item_list_pk
PRIMARY KEY (year_and_month, id)
) PARTITION BY LIST (year_and_month);
-- Previous year partitions built here...
-- Build out 2021 completely.
CREATE TABLE assembly_item_list_2021_01 partition of assembly_item_list HASH (assembly_id) ('2021-01');
CREATE TABLE assembly_item_list_2021_02 partition of assembly_item_list HASH (assembly_id) ('2021-02');
-- etc.
-- In case I screw up at the end of the year....
CREATE TABLE assembly_item_list_default partition of assembly_item_list default;
COMMIT;
Sample setup using a HASH
partition.
------------------------------------
-- Define table partitioned by hash
------------------------------------
BEGIN;
-- Drop parent table, if they exists.
-- This destroys ALL partitions automatically, even without a CASCADE clause.
DROP TABLE IF EXISTS assembly_item_hash CASCADE;
CREATE TABLE IF NOT EXISTS assembly_item_hash (
id uuid NOT NULL DEFAULT NULL,
assembly_id uuid NOT NULL DEFAULT NULL,
assembly_done_dts timestamp NOT NULL DEFAULT 'epoch', -- Copied in from assembly.done_dts when rows are pushed to Postgres.
partition_key text NOT NULL DEFAULT NULL, -- '00', '0A', etc. Populated in a BEFORE INSERT trigger on the partition. Can't use a generated column as a partition key, can't use a column reference in DEFAULT.
-- Reminder: id values come from the various source tables in IB. The upsert writes over matches ON CONFLICT with this ID.
-- Note: You *must* include the partition key in the primary key. It's a rule.
CONSTRAINT assembly_item_hash_pk
PRIMARY KEY (partition_key, id)
) PARTITION BY HASH (partition_key);
-----------------------------------------------------
-- Create trigger function to populate partition_key
-----------------------------------------------------
-- The partition key is a two-character hex string, like '00', '3E', and so on.
CREATE OR REPLACE FUNCTION set_partition_key()
RETURNS TRIGGER AS $$
BEGIN
NEW.partition_key = UPPER(LEFT(NEW.assembly_id, 2));
RETURN NEW;
END;
$$ language plpgsql IMMUTABLE; -- I don't think that I need to worry about IMMUTABLE here. 01234567890ABCDEF shouldn't break.
-----------------------------------------------------
-- Build partitions
-----------------------------------------------------
-- Note: Have to assign triggers to partitions individually.
-- Seems that it would be easier to add the logic to my central insert function.
CREATE TABLE assembly_item_hash_00 partition of assembly_item_hash FOR VALUES WITH (modulus 256, remainder 0);
CREATE TRIGGER set_partition_key_trigger_00
BEFORE INSERT OR UPDATE ON assembly_item_hash_00
FOR EACH ROW
EXECUTE PROCEDURE set_partition_key();
CREATE TABLE assembly_item_hash_01 partition of assembly_item_hash FOR VALUES WITH (modulus 256, remainder 1);
CREATE TRIGGER set_partition_key_trigger_01
BEFORE INSERT OR UPDATE ON assembly_item_hash_01
FOR EACH ROW
EXECUTE PROCEDURE set_partition_key();
-- And so on for all 256 partitions.
COMMIT;
Any advice? Really, anything that comes to mind?