This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?
One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.
Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.
I'd love to hear any thoughts, thanks in advance.