Apache Spark column with collected info vs rows union

Question

I have the following DataFrame:

+------+------------------+--------------+-------------+
|  name|             email|         phone|      country|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
|   Dan|   dan@example.com|+91-9999444999|       Poland|
| Scott| scott@example.com|+91-9111999998|        Spain|
|   Rob|   rob@example.com|+91-9114444998|        Italy|
+------+------------------+--------------+-------------+

after applying the following transformation:

val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
    f"FROM $table WHERE $tagCondition"
}

val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType))
  }
}

val unionDf = userTags.foldLeft(userTags.head) {
  case (acc, df) => acc.union(df)
}

I receive the following DataFrame:

+------+------------------+--------------+-------------+-------+
|  name|             email|         phone|      country|    tag|
+------+------------------+--------------+-------------+-------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|   Dan|   dan@example.com|+91-9999444999|       Poland| medium|
| Scott| scott@example.com|+91-9111999998|        Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):

+------+------------------+--------------+-------------+--------------+
|  name|             email|         phone|      country|           tag|
+------+------------------+--------------+-------------+--------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|         [big]|
|  Alex|  alex@example.com|+91-9999999998|       France|         [big]|
|  John|  john@example.com| +1-1111111111|United States|         [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
|   Dan|   dan@example.com|+91-9999444999|       Poland|      [medium]|
| Scott| scott@example.com|+91-9111999998|        Spain|  [big,medium]|
|   Rob|   rob@example.com|+91-9114444998|        Italy|         [big]|
+------+------------------+--------------+-------------+--------------+

Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.

Can you please post the code `buildTagQuery(tag, tagCondition, "tagged_users")`, maybe we can create an `UDF` that produces the `tag` column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the `sql` query so that union can be avioded. — Pavithran Ramachandran, Nov 09 '18 at 21:07
@PavithranRamachandran please see the updated question. I have added the info. — alexanoid, Nov 10 '18 at 06:47

score 1 · Accepted Answer · answered Nov 10 '18 at 15:21

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
  f"FROM $table WHERE $tagCondition"
}

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
  }
}

This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.

val headhere = userTags.head
val userResults  = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) {
  case (acc, df) => acc.union(df)
}

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID|  name|             email|         phone|      country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
|       0|  Alex|  alex@example.com|+91-9999999998|       France|            [big]|
|       1|  John|  john@example.com| +1-1111111111|United States|            [big]|
|       2|Donald|donald@example.com| +1-2222222222|United States|   [big, sometag]|
|       4| Scott| scott@example.com|+91-9111999998|        Spain|    [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

Thanks for your answer! Could you please compare from the performance point of view - your solution and `groupBy` by all columns, something like that : `val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))` ? — alexanoid, Nov 10 '18 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except `tag` column and then aggregating the tags list. I can't really tell which solution would be faster in practice. — user238607, Nov 10 '18 at 16:00
Thanks, I feel simple join(your solution) will work much faster. Thanks again! — alexanoid, Nov 10 '18 at 16:02
@alexanoid : In case the dataframe `unionDf3` is small enough, you can use a broadcast join to make it slightly faster. https://stackoverflow.com/questions/32435263/dataframe-join-optimization-broadcast-hash-join — user238607, Nov 10 '18 at 16:05

Apache Spark column with collected info vs rows union

1 Answers1