1

Oversimplified Scenario: A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.

Approach: Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.

Query: How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.

I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?

Usecase-1:

  • First-run

dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)

--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer

--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()

  • Second-Run - One additional column was added

dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)

--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer

--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show() 

Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.

steve
  • 129
  • 2
  • 9
Matthew
  • 55
  • 7

1 Answers1

0

It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:

from pyspark.sql import functions

#search for column names you want to sum, I put in "month"

column_search = lambda col_names: 'month' in col_names

#get column names of temp dataframe w/ only the columns you want to sum

relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns

#create dictionary with relevant column names to be passed to the agg function

columns = {col_names: "sum" for col_names in relevant_columns}

#apply agg function with your groupBy, passing in columns dictionary

grouped_df = original_df.groupBy("customer").agg(columns)

#show result

grouped_df.show()

Some important concepts can help you to learn:

  1. DataFrames have data attributes stored in a list: dataframe.columns

  2. Functions can be applied to lists to create new lists as in "column_search"

  3. Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"

  4. Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

  • Thanks Kevin for the answer. Im not sure if followed completely,let me read more about the details you provided and get back. – Matthew Mar 22 '20 at 16:37