0

What are different ways to join data in Spark?

Hadoop map reduce provides - distributed cache, map side join and reduce side join. What about Spark?

Also it will be great if you can provide simple scala and python code to join data sets in Spark.

Durga Viswanath Gadiraju
  • 3,896
  • 2
  • 14
  • 21
  • [How do you perform basic joins of two RDD tables in Spark using Python?](http://stackoverflow.com/q/31257077/1560062) – zero323 Dec 23 '15 at 06:24

1 Answers1

1

Spark has two fundamental distributed data objects. Data frames and RDDs.

A special case of RDDs in which case both are pairs, can be joined on their keys. This is available using PairRDDFunctions.join(). See: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Dataframes also allow a SQL-like join. See: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

David Maust
  • 8,080
  • 3
  • 32
  • 36