For Snowflake prod->dev scheduled cloning, what is a good way to handle custom rules?

Question

We are trying to setup dev and qa environments using data from a prod environment.

We are not using CREATE DATABASE dev CLONE prod because we are trying to avoid cloning database-specific objects like stages and pipes, since we are using per-environment Terraform to manage pipe-related objects and want to avoid out-of-band changes to those objects.

On top of that, there are some tables that should not be cloned from prod->dev. I'm trying to design a cleaner solution than the cobbled mess that we have.

We have a scheduled script that does the following:

Connect to prod and dev databases and fetch the right src and dst schemas
Run SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '<>' AND TABLE_TYPE = 'BASE TABLE' to get tables to clone
- Cloning tables across databases results in dangling references to constraints and sequences, so those have to be manually cloned https://docs.snowflake.com/en/user-guide/database-replication-considerations.html?#references-to-objects-in-another-database
For each table:
- If it shouldn't be cloned, skip it
- Run CREATE OR REPLACE TABLE <dev> CLONE <prod> COPY GRANTS;
- Run GET_DDL(<dev>) to see if the table has sequences/constraints to update
- Run CREATE OR REPLACE SEQUENCE <dev> CLONE <prod> to update the nextval of the sequence since our table was cloned and references the sequence from the source database (and it also has the wrong value anyways)
- Run ALTER TABLE <dev> ALTER COLUMN <> SET DEFAULT <new seq>.nextval
- Check if there are constraints
  - Run ALTER TABLE <dev> DROP CONSTRAINT <> since the cloned tables reference the source database
  - Run ALTER TABLE <dev> ADD CONSTRAINT <> to rebuild them to reference the destination database

So... it works, but it's hacky, fragile, and prone to updating because of custom rules. We currently have this running on an AWS lambda, but a first step would be to migrate this to pure Snowflake.

Does anyone have any suggestions to improve this process? Or at least have recommendations on Snowflake tools that

score 1 · Answer 1 · answered Jun 22 '22 at 11:03

I realise this is not really an answer to your question but I would absolutely not do what you are proposing to do - it's not the way to manage your SDLC (in my opinion) and, especially if your data contains any PII information, copying data from a Prod to a non-Prod database runs the risk of all sorts of regulatory and audit issues.

I would do the following:

As a one-off exercise, create the scripts necessary to build the objects for your "standard" environment - presumably basing this off your current Prod environment
Manage these scripts in a version-controlled repository e.g. Git

You can then use these scripts to build any environment and you would change them by going through the standard Dev, Test, Prod SDLC.

As far as populating these environments with data goes, if you really need Production-like data (and production volumes of data) then you should build routines for copying the data from Prod to the chosen target environment that, where necessary, anonymise the data. These scripts should be managed in your code repository and as part of your SDLC there should be a requirement to build/update the script for any new/changed table

Re: #1) We had migrated from Postgres to Snowflake and in the process lost the base definitions for our initial table definitions (oops), and we currently use Flyway to manage future schema changes - but I agree, I do realize the need to be able to init our database from scratch. // // Re: Prod-like anonymous data - we're currently a mix of doing consultancy work with customers (MLOps) so it's a bit tougher to anonymise the data when churning through models in dev (and it's partially the fault of our pipeline), though again I do know why this should happen. — aphrid, Jun 22 '22 at 14:29
Thinking about it more, I do agree that the process should be part of our SDLC - maybe the SQLizing of the process could fit into a Flyway migration script too, though that might just moving the problem without improving it. We're lucky (?) that all of our environments live in the same AZ in the same account without any cross-region data transfers, but that is prone to errors from our side. Thanks for the insight! I really do appreciate it — aphrid, Jun 22 '22 at 14:32

For Snowflake prod->dev scheduled cloning, what is a good way to handle custom rules?

1 Answers1