Removing columns with all 0s using SQL/ Snowpark (working in SageMaker)

Question

Right now, I'm just converting the df from a Snowpark df to a Pandas one and then removing columns with all 0s:

df = df.to_pandas()
df.loc[:, (df != 0).any(axis=0)]

Any way to do this without first converting to pandas?

Lukasz Szozda · Accepted Answer · 2023-04-11T19:18:47.393

It is possible to achieve this effect in Snowpark:

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, sum, sql_expr

def main(session: snowpark.Session): 
    df = session.create_dataframe([[0, 1, 2, 0], [0, 5, 0, 0], [0, 9, 9, 0], [0, 7, 1, 0]], \
                                  schema=["q", "x", "y", "z"])
    return df

/*
Q   X   Y   Z
0   1   2   0
0   5   0   0
0   9   9   0
0   7   1   0
*/

Using conditional aggregation to build column list that is later dropped:

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, sum, sql_expr

def main(session: snowpark.Session): 
    df = session.create_dataframe([[0, 1, 2, 0], [0, 5, 0, 0], [0, 9, 9, 0], [0, 7, 1, 0]], \
                                  schema=["q", "x", "y", "z"])
    
    dropcols = []
    for colname in df.columns:
        if df.agg(sum(sql_expr("CASE WHEN {colname}=0 THEN 0 ELSE 1 END". \
                               format(colname=colname)))).collect()[0][0]==0:
            dropcols.append(colname)
            
    return df.drop(dropcols)

/*
X   Y
1   2
5   0
9   9
7   1
*/

Output:

Compact version:

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, sum, sql_expr

def main(session: snowpark.Session): 
    df = session.create_dataframe([[0, 1, 2, 0], [0, 5, 0, 0], [0, 9, 9, 0], [0, 7, 1, 0]], \
                                  schema=["q", "x", "y", "z"])
    
    dropcols = [c for c in df.columns if df.agg(sum(sql_expr("CASE WHEN {c}=0 THEN 0 ELSE 1 END". \
                                                             format(c=c)))).collect()[0][0]==0 ]
    return df.drop(dropcols)

Thank you. Please could you explain this part of the code a little: if df.agg(sum(sql_expr("CASE WHEN {colname}=0 THEN 0 ELSE 1 END". \ format(colname=colname)))).collect()[0][0]==0: — YoungboyVBA, Apr 11 '23 at 22:47
@YoungboyVBA It perfroms an conditional sum (0 counted as 0/!=0 as 1) per each column. collected (single row/column), value is accessed [0][0] and compared with 0. If all values are 0 then it is true, therefore the column to be added for removal — Lukasz Szozda, Apr 12 '23 at 13:57

Removing columns with all 0s using SQL/ Snowpark (working in SageMaker)

1 Answers1

Linked