Can I split a Spark RDD into two tables, and then perform a join on them?

Question

I have a large TSV data file that contains, lumped together, the fact table and its dimension tables. I'm wondering if its possible through Spark, to divide/partition that single file into different 'tables', and then perform a join to normalize them?

Any help pointing me in the right direction would be awesome.

Rajkumar · Answer 1 · 2015-02-20T21:03:18.317

2

Apply filter on the baseRDD to get both factRDD & dimensionsRDD, then you can do a join on them.

val baseRDD = sc.textFile("...")  
val factRDD = baseRDD.filter(func1)  
val dimensionsRDD = baseRD.filter(func2)  
factRDD.join(dimentionsRDD)

edited Feb 20 '15 at 21:03

answered Feb 20 '15 at 07:58

Rajkumar

249
1
8

1

thank you, but would it be possible to do it without parsing the RDD twice? I know I'm nit picking, just curious. – JeffLL Feb 20 '15 at 19:26
@AlbertLim - That's a valid concern. This is similar to the problem [described here](http://stackoverflow.com/q/23995040/877069). Having to scan over the same data more than once is definitely suboptimal. – Nick Chammas Feb 20 '15 at 22:04

Can I split a Spark RDD into two tables, and then perform a join on them?

1 Answers1