A cartesian product is a join without any join key, it projects all the lines of a dataframe on all the lines of another dataframe. Say you have 2 dataframes A
and B
with nA
and nB
rows respectively, then you'll end up with a dataframe with nA x nB
rows.
What you are looking for is a regular inner join
with join key zip
:
A = spark.createDataFrame([[-69,40,"trp"],[-69,41,"nxt"]], ["Lat","lon","zip"])
B = spark.createDataFrame([[-68,43,"trp"],[-89,45,"trp"]], ["Lat","lon","zip"])
A.join(B, "zip").show()
+---+---+---+---+---+
|zip|Lat|lon|Lat|lon|
+---+---+---+---+---+
|trp|-69| 40|-68| 43|
|trp|-69| 40|-89| 45|
+---+---+---+---+---+
Note: be carefull with column name disambiguation, you can for instance put lon, Lat
in a structure for each dataframe before joining them:
import pyspark.sql.functions as psf
A = A.select("zip", psf.struct("Lat", "Long").alias("A"))
B = B.select("zip", psf.struct("Lat", "Long").alias("B"))
df = A.join(B, "zip")
df.show()
df.printSchema()
+---+--------+--------+
|zip| A| B|
+---+--------+--------+
|trp|[-69,40]|[-68,43]|
|trp|[-69,40]|[-89,45]|
+---+--------+--------+
root
|-- zip: string (nullable = true)
|-- A: struct (nullable = false)
| |-- Lat: long (nullable = true)
| |-- lon: long (nullable = true)
|-- B: struct (nullable = false)
| |-- Lat: long (nullable = true)
| |-- lon: long (nullable = true)