3

I have a large TSV data file that contains, lumped together, the fact table and its dimension tables. I'm wondering if its possible through Spark, to divide/partition that single file into different 'tables', and then perform a join to normalize them?

Any help pointing me in the right direction would be awesome.

JeffLL
  • 1,875
  • 3
  • 19
  • 30

1 Answers1

2

Apply filter on the baseRDD to get both factRDD & dimensionsRDD, then you can do a join on them.

val baseRDD = sc.textFile("...")  
val factRDD = baseRDD.filter(func1)  
val dimensionsRDD = baseRD.filter(func2)  
factRDD.join(dimentionsRDD)
Rajkumar
  • 249
  • 1
  • 8
  • 1
    thank you, but would it be possible to do it without parsing the RDD twice? I know I'm nit picking, just curious. – JeffLL Feb 20 '15 at 19:26
  • @AlbertLim - That's a valid concern. This is similar to the problem [described here](http://stackoverflow.com/q/23995040/877069). Having to scan over the same data more than once is definitely suboptimal. – Nick Chammas Feb 20 '15 at 22:04