1

I'm using pyspark 2.4 version.

I have a weird issue with dropping columns after joining.

I get the correct result if I drop one column, but I get an error if I drop two columns.

I want to drop the 'address' and 'role' columns from the workers1 data frame on the joined data frame (joined_workers).

from pyspark.sql import functions as f

workers1 = spark.createDataFrame(
    [("barmen", "Paris", "25"),
    ("waitress", None, "22")],
    ["role", "address", "age"])

workers1.toPandas()

>>>
    role      address   age
0   barmen    Paris     25
1   waitress  None      22

workers2 = spark.createDataFrame(
    [("barmen", "Paris"),
    (None, "Berlin")],
    ["role", "address"])

workers2.toPandas()

>>>
    role    address
0   barmen  Paris
1   None    Berlin

clumns_to_join_on = ["role", "address"]

joined_workers = workers1.alias("workers1").join(
    workers2.alias("workers2"),
    [
        *[getattr(workers1, col).eqNullSafe(
                getattr(workers2, col)) for col in clumns_to_join_on]
    ],
    how="right",
)

joined_workers.toPandas()

>>>
    role    address age     role    address
0   None    None    None    None    Berlin
1   barmen  Paris   25      barmen  Paris

# expected result

joined_workers.drop(*[f.col("workers1.role")]).toPandas()

>>>
    address age     role    address
0   None    None    None    Berlin
1   Paris   25      barmen  Paris

# Work as expected

joined_workers.drop(*[f.col("workers1.address")]).toPandas()

>>>
    role    age     role     address
0   None    None    None     Berlin
1   barmen  25      barmen   Paris

# Work as expected

joined_workers.drop(*[f.col("workers1.role"), f.col("workers1.address")]).toPandas()

>>>
TypeError: each col in the param list should be a string
Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
amit
  • 71
  • 2
  • 10

1 Answers1

0

Just select the columns you want to retain or select all columns except the one's to drop.

df.select([col for col in df.columns if col not in ['workers1.role', 'workers1.address']])

Update: In case of join with common column names:

joined_workers.select(["workers1."+col for col in workers1.columns if col not in ['role', 'address']]+["workers2."+col for col in workers2.columns if col not in [<if_any_from_2>]]).show()

Remove the second If condition from workers2.columns if all are to be retained from workers2 Dataframe

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
  • "Py4JJavaError: An error occurred while calling o441.select. : org.apache.spark.sql.AnalysisException: Reference 'role' is ambiguous, could be: workers1.role, workers2.role.;" – amit Jan 03 '23 at 09:06
  • @amit You used joined_workers.columns right in the loop? – Ronak Jain Jan 03 '23 at 09:10
  • @amit Added join solution – Ronak Jain Jan 03 '23 at 09:20
  • joined_workers.select([col for col in joined_workers.columns if col not in ['workers1.role', 'workers1.address']]) >>> Py4JJavaError: An error occurred while calling o441.select. : org.apache.spark.sql.AnalysisException: Reference 'role' is ambiguous, could be: workers1.role, workers2.role. – amit Jan 03 '23 at 10:07
  • @amit use the second solution – Ronak Jain Jan 03 '23 at 10:11
  • 1
    @amit Use joined_workers.select(["workers1."+col for col in workers1.columns if col not in ['role', 'address']]+["workers2."+col for col in workers2.columns]).show() – Ronak Jain Jan 03 '23 at 10:20