0

I am trying to check for a condition in pyspark dataframe and add values to a column like below:

DF:

cd    id    Location
A     A     A
A     AA    A
A     AAA   
A     B     B
A     BB    B
A     BBB   

Expected output:

cd    id    Location
A     A     A
A     AA    A
A     AAA   New_Loc
A     B     B
A     BB    B
A     BBB   New_Loc   

I tried to populate using the below pyspark transformation

df_new = df.withColumn('Location',sf.when(df.cd == 'A' & (df.id isin(['AAA','BBB'])),'New_Loc').otherwise(df.Location))

When i try to execute this, I am getting the error: Py4JError: An error occured while calling o129.and. Trace: py4j.Py4JException: Method and ([class java.lang.string]) does not exist

Any idea whats this error?

Padfoot123
  • 1,057
  • 3
  • 24
  • 43

2 Answers2

3

It's most likely the syntax. This should work:

import pyspark.sql.functions as f

df_new = df.withColumn(
  'Location', 
  f.when(
    (f.col('cd') == 'A') & (f.col('id').isin(['AAA','BBB'])),
    f.lit('New_Loc'))
  .otherwise(f.col('Location'))
)
Napoleon Borntoparty
  • 1,870
  • 1
  • 8
  • 28
  • Adding a parenthesis around the first condition worked. although lit was not required. Thank you for the response – Padfoot123 May 15 '20 at 10:56
  • Yup, `pyspark` typically struggles with operators such as `==` and follow-up logical operators such as `&`. That being said, I recommend using `f.lit()` and `f.col()` as some `pyspark` methods and functions are overloaded to both column names and literal values, so being specific is typically more beneficial and explicit. Also, as PEP8 tells us, explicit > implicit. – Napoleon Borntoparty May 15 '20 at 11:01
0

Ok.. adding a parenthesis around the conditions worked.

Below is what worked for me.

df_new = df.withColumn('Location',sf.when((df.cd == 'A') & (df.id isin(['AAA','BBB'])),'New_Loc').otherwise(df.Location))
Padfoot123
  • 1,057
  • 3
  • 24
  • 43
  • You can't write a string value directly in column without using lit. – Shubham Jain May 15 '20 at 10:50
  • please try the above and let me know if doesn't work. Here is another example :https://stackoverflow.com/questions/54839033/add-column-to-pyspark-dataframe-based-on-a-condition – Padfoot123 May 15 '20 at 10:54