1

Update 2022-08-31

While the chosen answer did confirm that the behavior was similar, there actually is a blind spot which is that using the cache_result implementation will not overwrite the previous table which means that extra storage will be used up when re-caching a new result. In the example below, we can see that a new temp table is being created. Hopefully snowflake can update this. I'm using 0.8.0 at the moment.

>>> dff = session.create_dataframe([1,2,3])

>>> dff.cache_result().explain()

---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT  *  FROM (SNOWPARK_TEMP_TABLE_X3FCJ1U38A)
...

--------------------------------------------

>>> dff.cache_result().explain()

---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT  *  FROM (SNOWPARK_TEMP_TABLE_Z9H68STVDH)
....


Original question

In snowpark ml, does the cache_result() method for the snowpark DataFrame class essentially do the same thing as writing to a temporary table? It sure seems that way, and it would save a lot of keystrokes, but I just wanted to be certain of it as well as understand any potential difference in resource credit usage.

In other words, are there any differences to these two approaches below?

sql = '''
select * from foo
'''

# Approach 1: saving to temporary table and then assigning the table
session.sql('...').write.mode('overwrite').save_as_table('my_tbl', table_type='temporary')
df = session.table('my_tbl')

# Approach 2: using Cache result
df = session.sql(sql).cache_result()

Maxim
  • 725
  • 1
  • 8
  • 24

1 Answers1

1

You are right. Die caching in Snowpark has an other meaning compared to Spark. df.cached_result() writes a temporary table. As the doc states:

https://docs.snowflake.com/ko/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#snowflake.snowpark.DataFrame.cache_result

Returns A DataFrame object that holds the cached result in a temporary table.

You can also verify by calling df.cached_result().explain() or the Python API source code

@df_collect_api_telemetry
def cache_result(
    self, *, statement_params: Optional[Dict[str, str]] = None
) -> "DataFrame":
    """Caches the content of this DataFrame to create a new cached DataFrame.

    All subsequent operations on the returned cached DataFrame are performed on the cached data
    and have no effect on the original DataFrame.

    Examples::
        >>> create_result = session.sql("create temp table RESULT (NUM int)").collect()
        >>> insert_result = session.sql("insert into RESULT values(1),(2)").collect()

        >>> df = session.table("RESULT")
        >>> df.collect()
        [Row(NUM=1), Row(NUM=2)]

        >>> # Run cache_result and then insert into the original table to see
        >>> # that the cached result is not affected
        >>> df1 = df.cache_result()
        >>> insert_again_result = session.sql("insert into RESULT values (3)").collect()
        >>> df1.collect()
        [Row(NUM=1), Row(NUM=2)]
        >>> df.collect()
        [Row(NUM=1), Row(NUM=2), Row(NUM=3)]

        >>> # You can run cache_result on a result that has already been cached
        >>> df2 = df1.cache_result()
        >>> df2.collect()
        [Row(NUM=1), Row(NUM=2)]

        >>> df3 = df.cache_result()
        >>> # Drop RESULT and see that the cached results still exist
        >>> drop_table_result = session.sql(f"drop table RESULT").collect()
        >>> df1.collect()
        [Row(NUM=1), Row(NUM=2)]
        >>> df2.collect()
        [Row(NUM=1), Row(NUM=2)]
        >>> df3.collect()
        [Row(NUM=1), Row(NUM=2), Row(NUM=3)]

    Args:
        statement_params: Dictionary of statement level parameters to be set while executing this action.

    Returns:
         A :class:`DataFrame` object that holds the cached result in a temporary table.
         All operations on this new DataFrame have no effect on the original.
    """
    temp_table_name = random_name_for_temp_object(TempObjectType.TABLE)
    create_temp_table = self._session._plan_builder.create_temp_table(
        temp_table_name, self._plan
    )
    self._session._conn.execute(
        create_temp_table,
        _statement_params=create_or_update_statement_params_with_query_tag(
            statement_params, self._session.query_tag, SKIP_LEVELS_TWO
        ),
    )
    new_plan = self._session.table(temp_table_name)._plan
    return DataFrame(session=self._session, plan=new_plan, is_cached=True)

and

   def create_temp_table(self, name: str, child: SnowflakePlan) -> SnowflakePlan:
        return self.build_from_multiple_queries(
            lambda x: self.create_table_and_insert(
                self.session, name, child.schema_query, x
            ),
            child,
            None,
            child.schema_query,
            is_ddl_on_temp_object=True,
        )

So I would say there is no difference between calling df.cached_result() and creating a temporary table on your own

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • Vry clever pointing that out with explain. That said, I did find one fault with using `cache_result` which is that it doesn't overwrite the temporary table. I'll explain more in my comments. – Maxim Aug 31 '22 at 20:13