0

I am having a DataFrame below -

from pyspark.sql.functions import col, when, length, lit, concat
values = [(1,'USA','12424','AB+'),(2,'Japan','63252','B-'),(3,'Ireland','23655',None),(4,'France','57366','O+'),
          (5,'Ireland','82351','A-'),(6,'USA','35854','B+'),(7,'Ireland','5835','AB-'),(8,'USA','95255','B+')]
df = sqlContext.createDataFrame(values,['id','country','postcode','bloodgroup'])
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
|  1|    USA|   12424|       AB+|
|  2|  Japan|   63252|        B-|
|  3|Ireland|   23655|      null|
|  4| France|   57366|        O+|
|  5|Ireland|   82351|        A-|
|  6|    USA|   35854|        B+|
|  7|Ireland|    5835|       AB-|
|  8|    USA|   95255|        B+|
+---+-------+--------+----------+

I need to make changes in columns postcode & bloodgroup according to the following conditions, as summarised in this rough python pseudocode -

# Efficient (pseudocode 1)
if country == 'Ireland':
    if length(postcode) == 4:
        postcode = '0'+postcode   # Append 0 to postcode incase it's 4 digit.
    if bloodgroup == null: 
        bloodgroup = 'Unknown'

As you can see in the pseudocode above, the check country == 'Ireland' was done only once as it was a common clause in the two conditions. Doing the other way by coupling this clause with the other two conditions using and would have been inefficient -

# Inefficient (pseudocode 2)
if country == 'Ireland' and length(postcode) == 4:
        postcode = '0'+postcode   
if country == 'Ireland' and bloodgroup == null: 
        bloodgroup = 'Unknown'

I am using PySpark and the only way I know how to do this is as follows -

df = df.withColumn('postcode',when((col('country') == 'Ireland') & (length(col('postcode')) == 4),concat(lit('0'),col('postcode'))).otherwise(col('postcode')))
df = df.withColumn('bloodgroup',when((col('country') == 'Ireland') & col('bloodgroup').isNull(),'Unknown').otherwise(col('bloodgroup')))
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
|  1|    USA|   12424|       AB+|
|  2|  Japan|   63252|        B-|
|  3|Ireland|   23655|   Unknown|
|  4| France|   57366|        O+|
|  5|Ireland|   82351|        A-|
|  6|    USA|   35854|        B+|
|  7|Ireland|   05835|       AB-|
|  8|    USA|   95255|        B+|
+---+-------+--------+----------+

But, this corresponds to the inefficient pseudocode I wrote above, because we are checking country == 'Ireland' two times. I have checked the executionPlan using df.explain() and it does not do any automatic optimization, which I thought catalyst might to.

How can we write a PySpark code corresponding to pseudocode 1, where we do the check of country once and then test the 2 conditions?

cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • 1
    I can think of a way to do this similar to the "efficient" approach you showed, but it would return a `StructType` column. Otherwise, chained `withColumn` statements are probably the most efficient you're going to get since there's no easy way to [get multiple columns in a single call to withColumn](https://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe). – pault Dec 11 '18 at 16:34
  • Many thanks Pault for your comments. Always appreciated. Thanks a lot referring me to this link. I will investigate it further. – cph_sto Dec 11 '18 at 19:41

0 Answers0