Pyspark Join and then column select is showing unexpected output

Question

I am not sure if the long work is doing this to me but I am seeing some unexpected behavior in spark 2.2.0

I have created a toy example as below

toy_df = spark.createDataFrame([
['p1','a'],
['p1','b'],
['p1','c'],
['p2','a'],
['p2','b'],
['p2','d']],schema=['patient','drug'])

I create another dataframe

mdf = toy_df.filter(toy_df.drug == 'c')

as you know mdf would be

 mdf.show()
+-------+----+
|patient|drug|
+-------+----+
|     p1|   c|
+-------+----+

Now If I do this

toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drug).show()

Surprisingly I get

+---+---+-------+----+
| P1| D1|patient|drug|
+---+---+-------+----+
| p2|  a|     p2|   a|
| p2|  b|     p2|   b|
| p2|  d|     p2|   d|
| p1|  a|     p1|   a|
| p1|  b|     p1|   b|
| p1|  c|     p1|   c|
+---+---+-------+----+

but if I use

toy_df.join(mdf,["patient"],"left").show()

I do see the expected behavior

 patient|drug|drug|
+-------+----+----+
|     p2|   a|null|
|     p2|   b|null|
|     p2|   d|null|
|     p1|   a|   c|
|     p1|   b|   c|
|     p1|   c|   c|
+-------+----+----+

and if I use an alias expression on one of the dataframes I do get the expected behavior

toy_df.join(mdf.alias('D'),on=["patient"],how="left").select(toy_df.patient.alias("P1"),toy_df.drug.alias("D1"),'D.drug').show()

| P1| D1|drug|
+---+---+----+
| p2|  a|null|
| p2|  b|null|
| p2|  d|null|
| p1|  a|   c|
| p1|  b|   c|
| p1|  c|   c|
+---+---+----+

So my question is what is the best way to select columns after join and is this behavior normal

edit : as per user8371915 this is same as the question tagged as
Spark SQL performing carthesian join instead of inner join

but my question works with two dataframe who have same lineage and performing the join when the show method is invoked but the select columns after join behaving differently .

Since the result of `df.col` or `df['col']` is a `Column` type which is not bound to the dataframe, I believe the result is expected. I'm wondering why aren't you getting `ambiguous column name` error while selecting in the erroneous case. — philantrovert, Jun 07 '18 at 08:18
In general join between `DataFrames` sharing the same lineage can result in trivially true / false predicates. This case should be handled automatically, but it looks like things slipped through the cracks here. Honest advice - always use aliases. — Alper t. Turker, Jun 07 '18 at 10:57
Possible duplicate of [Spark SQL performing carthesian join instead of inner join](https://stackoverflow.com/questions/32190828/spark-sql-performing-carthesian-join-instead-of-inner-join) — Alper t. Turker, Jun 07 '18 at 10:57

score 4 · Accepted Answer · answered Jun 07 '18 at 15:35

The best way is to use aliases:

toy_df.alias("toy_df") \
    .join(mdf.alias("mdf"), ["patient"], "left") \
    .select(
        col("patient").alias("P1"),
        col("toy_df.drug").alias("D1"),
        col("patient").alias("patient"),
        col("mdf.drug").alias("drug")
    ) \
    .show()

The problem is that mdf is derived from toy_df so both toy_df.drug and mdf.drug refer to the same column. Therefore, when you pass those to select, Spark returns values from the same column as well.

score 2 · Answer 2 · answered Jun 07 '18 at 01:03

I was able to replicate your findings and I wish I had an answer of why this happens. However, I was able to get your desired results by just changing the alias of the second (right) dataset. I changed mdf.drug to mdf.drugs

mdf = toy_df.filter(toy_df.drug == 'c').select(toy_df.patient,toy_df.drug.alias("drugs"))

so after the join..

toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drugs).show()

I got the expected behavior

| P1| D1|patient|drugs|
+---+---+-------+-----+
| p2|  a|     p2| null|
| p2|  b|     p2| null|
| p2|  d|     p2| null|
| p1|  a|     p1|    c|
| p1|  b|     p1|    c|
| p1|  c|     p1|    c|
+---+---+-------+-----+

I am going to do more research and see if I can expand to this initial answer

Pyspark Join and then column select is showing unexpected output

2 Answers2

Linked