1

I've read a lot about snowflake caching, but do not understand how snowflake gets around data freshness problem encountered by other "shared everything architectures". My understanding is that, with shared everything when the data changes, there essentially needs to be some locking.

As explained in this article:

For efficiency both nodes have cached local copies of record 1 in memory. A client then tries to update record 1 so that ‘foo’ becomes ‘bar’. To do this in a consistent manner the DBMS must take a distributed lock on all nodes that may have cached record 1. Such distributed locks become slower and slower as you increase the number of machines in the cluster and as a result can impede the scalability of the writing process.

In other words, if snowflake is caching data in the compute layer, but the underlying data changes in S3, doesn't snowflake run into the same caching issues of other shared everything architectures? I get that in an analytics database, there are fewer updates, but then why wouldn't traditional shared-everything architectures work?

The only answer I've seen is that snowflake has a query optimizer that checks to see if the underlying data has changed. But still don't understand this/how this is a significant upgrade over caching strategy by traditional shared-everything DBs.

user924088
  • 107
  • 1
  • 10
  • It’s not clear from your question what you think Snowflake is claiming, regarding cacheing, that doesn’t apply to some other databases - please can you explain? Snowflake will use the data it holds in various types of cache unless the cached data is stale in which case it will query the tables (in a similar way to many other DBMS’s) - what’s your question/issue with this? – NickW Nov 03 '21 at 20:47
  • Like Nick I'm not sure what your question is leaning at. But I'd reply top the last portion of your question. Snowflake's statistics on data is always up to date, you don't need to update them. Therefore Snowflake knows "remote" storage data has changed and therefore it might invalidate what's in cache and therefore need to fetch the data from remote storage. – patrick_at_snowflake Nov 03 '21 at 22:55
  • Hey Gang - I'm a data engineering teacher...so have no issue/not getting at anything..just when I explain technologies, I try to teach by explaining how they are upgrades over previous technologies...for snowflake it seems to be that the compute layer is separated from storage, but I was having a hard time of understanding how it's an improvement over shared everything architecture. So I was planning on explaining that the evolution was: 1. shared everything (issue of caching/locking) 2. shared nothing (issues with shuffling/scaling) 3. Snowflake - somehow best of both, but how.. – user924088 Nov 04 '21 at 00:15
  • The benefit for splitting compute from storage is that you can scale up/out your compute resources without impacting your data. So Snowflake allows you to run multiple warehouses (compute engines) of various sizes, all reading the same data but without impacting each other – NickW Nov 04 '21 at 09:23

1 Answers1

1

I believe the key concept that you are missing here is that Snowflake never updates a record...ever. It instead creates a new Micro-Partition that contains all the data from the old Micro-Partition with the changes applied. In this way, the services layer of Snowflake immediately knows that a Micro-Partition in a warehouse cache is no longer valid, because it's no longer active according to the services layer.

The only time you'd need a lock is if you were "updating" the cache or the underlying data. That never happens here. The new micro-partitions become active as soon as the update statement is complete. While the update is running, there are no "dirty" states of the data, so any query that runs against that table will always either be the previous state while the update is running or the post-update state. No locks needed.

Does that answer your question?

Mike Walton
  • 6,595
  • 2
  • 11
  • 22