I need to allow users to define different named collections which they can use during Spark DataFrame SQL construction latter.
I planned to use Spark broadcast variables for this purpose but based on the following SO question How to refer broadcast variable in Spark DataFrameSQL looks like it is impossible
Let's say as a user I have created the following collection through the application UI:
name: countries_dict
values: Seq("Italy", "France", "United States", "Poland", "Spain")
In another application UI(let's day different page) as the user I have created the following Spark SQL query:
SELECT name, phone, country FROM users
and I'd like to filter the records by SELECT name, phone, country FROM users WHERE countries in countries_dict
So, for example, right now I can create something similar in the following way:
val countriesDict = Seq("Italy", "France", "United States", "Poland", "Spain")
val inDict = (s: String) => {
countriesDict.contains(s)
}
spark.udf.register("in_dict", inDict)
and then:
SELECT name, phone, country FROM users WHERE in_dict(country)
but the biggest issue with this approach, that the countriesDict
is hardcoded in the code and not created dynamically based on the user input on UI.
Is it possible to extend this approach somehow to support dynamically created collections(by users) with names and elements via application UI?