I have the following DataFrame:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
after applying the following transformation:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"FROM $table WHERE $tagCondition"
}
val userTags = tags.map {
case (tag, tagCondition) => {
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
}
}
val unionDf = userTags.foldLeft(userTags.head) {
case (acc, df) => acc.union(df)
}
I receive the following DataFrame:
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag
column):
+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+
Right now I don't know how to change my transformation in order to receive such a structure with the tag
column like ArrayType
without original row duplication.