1

I have a table that consists of 56millions rows.

This table is handling high load of UPSERTS every 5 minutes as it's loading streaming data from KAFKA. Approx 200-500k updates every load.

When I run a SELECT with an ORDER BY against one of the timestamp columns, it takes over 5-7 minutes to return a result.

I tried Cluster Key for that column but since there is a high DML operation on that table and high cardinality on the column itself, the clustering was ineffective and costly.

So far, the only think that has significantly reduced query time to about 15 seconds is increasing the warehouse size to an X-Large from a Small.

I am not convinced that the only solution is to increase the warehouse size. Any advice here would be great!

donkey
  • 25
  • 4
  • ORDER BY operations typically require larger warehouses so they have enough space to do the sorting. If you are not filtering by the timestamp field, then clustering won't help you in the sort operation. Look at the SQL profile to see whether the initial scan is costing you time vs. the sort. If it's the sort, are you seeing a lot of Spilling to Remote Storage? If so, you need a larger warehouse. – Mike Walton Jul 18 '21 at 14:14
  • Are you trying to "select order by" the 56 million rows? Did you a limit to the query? – Felipe Hoffa Jul 18 '21 at 20:26
  • I was querying ```select * from table order by date desc ``` with no limit, however, I thought DataGrip automatically limited the query by 500. I checked the Query Plan and the sorting was the most expensive operation with a lot of spilling to remote storage. There are also 40 columns in the table. Things that significantly reduce the query execution time were limiting result, columns in the select and a larger warehouse. Thanks! – donkey Jul 19 '21 at 23:32

2 Answers2

1

Clustering on date(timestamp) (or something that's lower cardinality) would be more effective, although because of the volume of updates it will still be expensive.

At a happy hour event, I heard a Snowflake user that achieved acceptable results on a similar (ish) scenario by clustering on late arriving facts (e.g. iff(event_date<current_date, true, false))) (although I think they were INSERTing not UPSERTing and in the later case the micropartitions have to be re-written anyway so it might not help much.)

There are other things to consider too.

Inspect the query plan to confirm that ordering is the problem (e.g is a lot of time spent on ordering.) Without seeing your actual query, I wonder if a majority of the time is spent on the table scan (when it is grabbing the data from remote storage.) If a larger warehouse improves performance, this is likely the case since every added node in the cluster means more micro-partitions can be read concurrently.

Nat Taylor
  • 1,122
  • 7
  • 9
  • I tried the above but due to the number of UPSERTs clustering was ineffective. It seems limit the query result and columns in the select statement on a Small warehouse returns a result in about 4 seconds! With this in mind, if I need faster results I can increase warehouse size but this already optimal! – donkey Jul 19 '21 at 23:34
0

Are you running against:

  1. A true timestamp column?
  2. A JSON column cast as time stamp but no additional function?
  3. How many fields in the JSON
  4. What is the relative ratio of UPDATEs to INSERTs?
  5. Have you looked at the cluster statistics?
Jeffrey Jacobs
  • 302
  • 1
  • 4
  • 1. Yes 2. No 3. N/A 4. There are more updates than inserts but not sure the exact ration. I'd say 75% at least are updates. The remainder being inserts. 5. Yes, avg cluster depth was quite high after each upsert between 200-400 and the micro-partitions were mostly at the lower-end of the histogram. As mention above, I found that limit the query result set and columns in the select dramatically reduce the execution time to around 4 seconds. – donkey Jul 19 '21 at 23:39
  • Just as an FYI, limiting the result set is generally the last step after all of the rows have been processed. It only reduces the transmission time. Glad you solved it. – Jeffrey Jacobs Jul 21 '21 at 02:17