1

This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?

One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.

Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.

I'd love to hear any thoughts, thanks in advance.

benjaminjsanders
  • 827
  • 8
  • 13
  • 1
    A better solution in a pipeline might be never using SSNs to join. Just create a new table mapping SSNs to unique ids particular to your database. Then populate your tables using your unique ids. Consider that SSNs are not a reliable primary key to begin with https://stackoverflow.com/a/45437673/132438 – Felipe Hoffa Jun 04 '21 at 04:35
  • `I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?` This would be insecure. It's a simple matter to create a table containing every possible SSN. Now think of a two column table, with identical SSN values per row. By applying the masking policy to one column and joining it to a table with real SSNs that are masked, the engineer who's not supposed to see SSNs has access to all of them. – Greg Pavlik Jun 04 '21 at 17:49
  • @GregPavlik : Well, I suppose that explains that! Thanks! – benjaminjsanders Jun 04 '21 at 17:55

1 Answers1

1

If you care about database integrity and avoid errors: Don't use SSNs as identifiers.

A SSN can be a property of a person, but don't use it as their primary key.

As the United States Social Security Administration says:

A 1990 OIG, HHS study indicated that 45% of organizations, both public and private, using SSNs make no effort to verify SSN accuracy. This leads to the real possibility that transfers of data from one organization to another could be inaccurate; computer matching of data between different organizations could be invalid; and innocent persons could be subjected to unwarranted intrusions into their privacy or improper changes in their benefits or services or even misidentified with serious results.

Also:

The SSN is the single most widely used record identifier for both government and the private sector, exerting a broad influence on the lives of most Americans. However, by itself, it is not a personal identifier because it lacks systematic assignment to every person and the means to authenticate a person's identity.

Instead you could create a unique id for each person within your database, and use that key for joins.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325