2

I am trying to get a filtered list of list of auctions around the time of specific winning auctions while using spark. The winning auction RDD, and the full auctions DD is made up of case classes with the format:
case class auction(id: String, prodID: String, timestamp: Long)

I would like to filter the full auctions RDD where auctions occurred within 10 seconds of the winning auction, on the same product ID, and receive an RDD full of these.

I have attempted to filter it like this:

val specificmessages = winningauction.map(it =>
  allauctions.filter( x =>
    x.timestamp > it.timestamp - 10 &&
    x.timestamp < it.timestamp + 10 &&
    x.productID == it.productID
  )
)

Is there a way to perform this as nested transformations are not possible?

there is another answer but this mainly deals with nested maps SPARK-5603 nested map funcitons

Community
  • 1
  • 1
eboni
  • 883
  • 2
  • 10
  • 25
  • Try looking at the `cartesian` https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD to build a new RDD and apply your filter to it – ccheneson Jul 08 '15 at 14:20
  • 1
    Untested : val specificmessages = allauctions.cartesian(winningauction).filter( (x, y) => x.timestamp > y.timestamp - 10 && x.timestamp < y.timestamp + 10 && x.productID == y.productID ) – ccheneson Jul 08 '15 at 14:25
  • yep, thanks for reminding me about cartesian function. It works an absolute treat. Can you add it as an answer? – eboni Jul 08 '15 at 14:46

1 Answers1

5

Try looking at the cartesian method to build a new RDD and apply your filter to it

val specificmessages = allauctions.cartesian(winningauction)
                                  .filter( (x, y) => x.timestamp > y.timestamp - 10 && 
                                               x.timestamp < y.timestamp + 10 && 
                                               x.productID == y.productID )
zengr
  • 38,346
  • 37
  • 130
  • 192
ccheneson
  • 49,072
  • 8
  • 63
  • 68