0

I'm working with a map object in scala where the key is a basket ID and the value is a set of item ID's contained within a basket. The goal is to ingest this map object and compute for each basket, a set of other basket ID's that contain at least one common item.

Say the input map object is

val basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))

Is it possible to perform the computation in spark such that I get the intersecting basket information back? For example val intersects = Map("b1" -> Set("b2", "b3"), "b2" -> Set("b1"), "b3" -> Set("b1"), "b4" -> Set())

Thanks!

tyjchen
  • 5
  • 2

1 Answers1

0

Something like...

val basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))

def intersectKeys( set : Set[String], map : Map[String,Set[String]] ) : Set[String] = {
  val checks = map.map { case (k, v) =>
    if (set.intersect(v).nonEmpty) Some(k) else None
  }
  checks.collect { case Some(k) => k }.toSet
}

// each set picks up its own key, which we don't want, so we subtract it back out
val intersects = basket.map { case (k,v) => (k, intersectKeys(v, basket) - k) }
Steve Waldman
  • 13,689
  • 1
  • 35
  • 45
  • i tried executing this solution to my entire problem and noticed that my data preprocessing is not very scalable. I was wondering if you had any ideas how i can make it more scalable, just sharing the link here: https://stackoverflow.com/questions/63677703/how-to-efficiently-parse-dataframe-object-into-a-map-of-key-value-pairs – tyjchen Aug 31 '20 at 20:25
  • some very simple things that might speed this up... `set.intersect(v).nonEmpty` is wasteful, it computes the full intersection of both sets where you're just looking for overlap. you might try `set.find( e => v(e) ).nonEmpty`, which would just search for the first joint element. the final step, could be parallelized, which could be as simple as `basket.par.map { ... }.toMap` (i haven't tried these things, just some quick ideas.) – Steve Waldman Sep 01 '20 at 00:23