2

We are ingesting data into Snowflake via the kafka connector. To increase the data read performance / scan less partitions we decided to add a clustering key to a a key / combination of keys stored in the RECORD_CONTENT variant field.

The data in the RECORD_CONTENT field looks like this:

{
  "jsonSrc": {
    "Integerfield": 1,
    "SourceDateTime": "2020-06-30 05:33:08:345",
    *REST_OF_THE_KEY_VALUE_PAIRS*
}

Now, the issue is that clustering on a datetime col like SourceDateTime does NOT work:

CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))

...while clustering on a field like Integerfield DOES work:

CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )

Not working means: when using a filter on RECORD_CONTENT:jsonSrc:loadDts::datetime, it has no effect on the partitions scanned, while filtering on RECORD_CONTENT:jsonSrc:Integerfield::int does perform partition pruning.

What is wrong here? Is this a bug?

Note that:

  • There is enough data to do meaningful clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime
  • I validated clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime working by making a copy of the raw table, with RECORD_CONTENT:jsonSrc:loadDts::datetime in a seperate column loadDtsCol and then adding a similar clustering key on that column: to_date(loadDtsCol).

1 Answers1

1

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes: Dates and timestamps, especially non-ISO 8601dates and timestamps, as string values

Numbers within strings

Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

See this link: https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure

Mike Gohl
  • 627
  • 4
  • 7
  • Thanks mike. I guess I need to read this as 'for ANY pruning at all to work, when the data contains timestamps, it needs to be stored in a separate column' . Because whatever I tried with the datetime field, partition pruning does not work at all... – Rogier Werschkull Jul 02 '20 at 06:45