Postgresql Prune replicated data from outbox table

Question

Problem Statement

In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.

Context

Postgres is at v12

We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.

When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.

For example when I run:

select * from pg_replication_slots;

I might see:

 slot_name |  plugin  | slot_type | datoid |      database      | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
 debezium  | wal2json | logical   |  26593 | database_name | f         | t      |       7404 |      |        26729 | 0/DCD98E8   | 0/DCD9920
(1 row)

What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.

For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:

select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);

Any guidance would be sweet! Cheers!

Another thought I had was to set up a kafka consumer explicitly for knowing deleting that row if I know it's been replicated across all partitions, but this seems like a less than ideal solution. — MattyIceDev, Mar 22 '22 at 03:46

score 1 · Answer 1 · answered Apr 01 '22 at 19:21

1

I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction). I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.

answered Apr 01 '22 at 19:21

Bennett T

11
1

Yeah, I definitely looked at this as well. I will say part of the reason we are implementing this pattern is to get more resiliency into our messages, today our messages are held in memory as we batch send them to kafka. This has caused issues when kafka is unavailable or the pod crashes. Immediately deleting the messages then takes away our ability to have ad hoc snapshot capabilities. – MattyIceDev May 02 '22 at 18:05

score 0 · Answer 2 · answered May 02 '22 at 18:02

Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:

SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');

So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.

Postgresql Prune replicated data from outbox table

Problem Statement

Context

2 Answers2