14

We plan to move Apache Pig code to the new Spark platform.

Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins.

For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
Shawn Guo
  • 3,169
  • 3
  • 21
  • 28
  • It's not ready yet but spork(pig on spark) is being built currently, so you may not have to change any of your code – aaronman Jul 21 '14 at 14:58

2 Answers2

25

It is oneRDD.cartesian(anotherRDD).

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
4

Here is the recommended version for Spark 2.x Datasets and DataFrames:

scala> val ds1 = spark.range(10)
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds1.cache.count
res1: Long = 10

scala> val ds2 = spark.range(10)
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds2.cache.count
res2: Long = 10

scala> val crossDS1DS2 = ds1.crossJoin(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]

scala> crossDS1DS2.count
res3: Long = 100

Alternatively it is possible to use the traditional JOIN syntax with no join condition. Use this configuration option to avoid the error that follows.

spark.conf.set("spark.sql.crossJoin.enabled", true)

Error when that configuration is omitted (using the "join" syntax specifically):

scala> val crossDS1DS2 = ds1.join(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]

scala> crossDS1DS2.count
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
...
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;

Related: spark.sql.crossJoin.enabled for Spark 2.x

Community
  • 1
  • 1
Garren S
  • 5,552
  • 3
  • 30
  • 45
  • When you are doing your Dataset Join Dataset, your result is a DataFrame, but I would expect it to be another Dataset instead... why not use joinWith instead? – Dan Ciborowski - MSFT Apr 27 '18 at 14:42
  • Good eye Dan! The example was meant to be purely illustrative of cross join semantics, so using joinWith to get a Dataset back wasn't top of mind. I'll update the answer, but your question opened another line of inquiry around crossJoin method returning DF not DS, leaving users to use joinWith and the configuration option if they wish to maintain their DS, hmm – Garren S Apr 27 '18 at 19:13
  • It seems to me to use joinWith, and do a cross join, you have to use two contradicted statements that union to your whole dataset, I guess this is to really make sure you want to do a cross join – Dan Ciborowski - MSFT May 01 '18 at 05:56