0

I have two large dataframes with around couple of million records in each.

val df1 = Seq(
 ("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")

val df2 = Seq(
 ("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")

I am trying to join and perform a groupBy as below, but it is taking forever for the result. There are around one million unique instances of grp1+grp2 data in df1.

val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")

Is there a way to reduce shuffling? Is mapPartitions right approach for these two scenarios? Can anyone suggest an efficient way with an example.

M S
  • 31
  • 2
  • 4
  • Hi, df3 is not used in df4's computation. It is a mistake? Also could tell us what you do with df4 after creating it? Spark being lazy, the code you wrote does not trigger anything. Seeing how you manipulate df4 before saving it (collect or save) may help us tell you why it's taking so long. – Oli Jul 06 '18 at 09:43
  • Hi.. I corrected the computation of df4. Basically, this is simplified scenario. Once I perform the grouping, I a) store the df4 results b) compute new df3 by removing key1/key2 combos found in grouped df4, from current df3 c) compute new df4 by performing another group on the new df3 (by grpy, grpx fields, not mentioned in this example) Steps a-c repeat for two more iterations with different group by criteria. – M S Jul 06 '18 at 10:50
  • You're saying that in df1, the tuples (key1, key2) are unique. What about df2, can there be duplicated (key1, key2) tuples? – Oli Jul 06 '18 at 12:38
  • The tuples (key1,key2) are unique in both df1 and df2. – M S Jul 06 '18 at 13:31
  • mapPartitions does not work for DFs – thebluephantom Jul 08 '18 at 08:06
  • See https://stackoverflow.com/questions/43831387/how-to-avoid-shuffles-while-joining-dataframes-on-unique-keys – thebluephantom Jul 08 '18 at 08:19
  • so much for the under the hood DF Optimization I keep reading on – thebluephantom Jul 08 '18 at 08:24
  • 1M is low in Big Data terms – thebluephantom Jul 08 '18 at 08:25
  • can u show explain(true)? – thebluephantom Jul 08 '18 at 08:49
  • size of tables sources involved – thebluephantom Jul 08 '18 at 09:19

0 Answers0