Drop function not working after left outer join in pyspark

Question

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:

a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)

b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)

c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]

The drop function is not removing the columns.

But if I try to do:

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)

Then priority column for a_df gets dropped.

Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.

I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?

Thanks in advance.

score 1 · Answer 1 · answered Feb 12 '19 at 04:10

Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.

But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

From the databricks:

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.

When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.

df = left.join(right, ["priority"])

Agreed the aliases (or dropping non relevant columns will work), but still I was very confused as to why the columns for the df on the right hand side will not get dropped, but the ones on left side will. Also with inner join, I can drop anything. — Parikshit Maheshwari, Feb 12 '19 at 09:21

Drop function not working after left outer join in pyspark

1 Answers1

Linked