What are different ways to join data in Spark?
Hadoop map reduce provides - distributed cache, map side join and reduce side join. What about Spark?
Also it will be great if you can provide simple scala and python code to join data sets in Spark.
What are different ways to join data in Spark?
Hadoop map reduce provides - distributed cache, map side join and reduce side join. What about Spark?
Also it will be great if you can provide simple scala and python code to join data sets in Spark.
Spark has two fundamental distributed data objects. Data frames and RDDs.
A special case of RDDs in which case both are pairs, can be joined on their keys. This is available using PairRDDFunctions.join()
. See: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Dataframes also allow a SQL-like join. See: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame