1

PROBLEM: While setting up a CDC pipeline using datastream in Google Cloud platform, when there is a delete query fired on the source table the same is getting reflected on the destination table as well which we need to prevent.

SOLUTION NEEDED: How do we prevent the deletion of the data that is already written to the destination table even if there is a deletion done on the source table using datastream service of google cloud platform.

p.s. Open to all types of solutions.

Note: we are also considering writing a trigger on big query tables that will restore deleted data but that is only and only if we are not able to control this behaviour using datastream and we have exhausted all other options.

DEBUGGING DONE: We were trying to prevent the deletion of the data in the destination table by introducing a filter into the Datastream that will only stream data that has its datastream metadata field "is_deleted__" boolean marked as false,so that if any data gets deleted in the source it will be marked as true and datastream would not pick up those rows to update in the destination table. But there is a problem.

ISSUE WITH THE DEBUGGING: We are able to see the is_deleted field only with tables that does not have a primary key and is not available with tables that have a primary key hence this solution can only be implemented to tables without primary key.

  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 22 '23 at 20:17
  • This is a complete UI based setup on google cloud platform and there is no code involved at this point of time. Maybe if there is a development will add it across! – Rahul Ranjit Feb 23 '23 at 06:56
  • I see Datastream in combination with BigQuery currently working more as a replication solution rather than a CDC. We have the same problem. You will probably get format you want if you dump your data into GCS first but you will add a fair amount of complexities to your design. Have you found any other way to support this in the meantime? Also posting a link to the GCS workaround. https://medium.com/@meraghudy/handling-soft-deletes-and-partitioning-with-google-cloud-datastream-change-data-capture-fa928cd69e1e – martez May 22 '23 at 14:12
  • What is your source database? – vilozio Aug 08 '23 at 15:24

1 Answers1

0

Depending on your source database, you can do the following.

PostgreSQL. You can change the publication that is used by Datastream to exclude DELETE and TRUNCATE operations ALTER PUBLICATION publication_name SET (publish = 'insert, update');.

MySQL. As suggested here, you can run SET SQL_LOG_BIN=0; for the relevant session on the master before executing your delete.

Oracle. I don't have expertise with Oracle database. I recommend consulting Oracle's documentation or support.

Datastream relies upon Oracle LogMiner. I didn't find there a way how to exclude DELETE operations.

I might suggest to write a trigger to prevent a delete operation like described in this answer (this will work for a single table). I adjusted the example so it will set the row's column is_deleted to 1

CREATE OR REPLACE TRIGGER prevent_delete_on_table
INSTEAD OF DELETE ON <TABLENAME>
BEGIN
  UPDATE <TABLENAME>
  SET is_deleted = 1
  WHERE id = :OLD.id;
END

But again, before implementing triggers I recommend consulting with an Oracle database professional.

vilozio
  • 111
  • 1
  • 8