Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer). I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ? What is Query 0 which is related to ... zero jobs ?