0

I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.

I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.

mar tin
  • 9,266
  • 23
  • 72
  • 97
  • 1
    are you running etl at the same time? dms? – Jon Scott Mar 08 '18 at 11:53
  • Yes, I have some ETL jobs that run continuously on it – mar tin Mar 08 '18 at 12:08
  • i have had a similar issue when there has been contention between etl and query jobs – Jon Scott Mar 08 '18 at 17:10
  • 1
    It might be a lot of updates that are not being committed until the very end. Doing smaller, committed batches might reduce the temporary storage requirements. You can also [find size of database, schema, table in redshift](https://stackoverflow.com/q/21767780/174777) to find out where the space is going. – John Rotenstein Mar 08 '18 at 22:42
  • @JohnRotenstein You mean the ETL updates? In which case the presence of user queries would just be too much on top of those? – mar tin Mar 08 '18 at 22:44
  • 1
    Your jobs would certainly compete against each other. This can be minimised by the use of [Query Queues](https://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html) that can provided guaranteed resources for different classes of queries (eg ETL vs ad-hoc queries). However, your exhaustion of disk space is more worrying and should be investigated first by figuring out what activities are consuming the disk space. – John Rotenstein Mar 08 '18 at 22:52

2 Answers2

4

One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.

If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.

My team discussed this a bit more on our team blog the other day.

Ying
  • 106
  • 5
1

This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).

If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • Makes sense, but in my case it's queries that normally take very little time sometimes take much longer, and end up filling the cluster. – mar tin Mar 14 '18 at 21:21