Joining data sets in Spark

Question

What are different ways to join data in Spark?

Hadoop map reduce provides - distributed cache, map side join and reduce side join. What about Spark?

Also it will be great if you can provide simple scala and python code to join data sets in Spark.

[How do you perform basic joins of two RDD tables in Spark using Python?](http://stackoverflow.com/q/31257077/1560062) — zero323, Dec 23 '15 at 06:24

score 1 · Accepted Answer · answered Dec 23 '15 at 06:29

Spark has two fundamental distributed data objects. Data frames and RDDs.

A special case of RDDs in which case both are pairs, can be joined on their keys. This is available using PairRDDFunctions.join(). See: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

1 Answers1