I am having a DataFrame
below -
from pyspark.sql.functions import col, when, length, lit, concat
values = [(1,'USA','12424','AB+'),(2,'Japan','63252','B-'),(3,'Ireland','23655',None),(4,'France','57366','O+'),
(5,'Ireland','82351','A-'),(6,'USA','35854','B+'),(7,'Ireland','5835','AB-'),(8,'USA','95255','B+')]
df = sqlContext.createDataFrame(values,['id','country','postcode','bloodgroup'])
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
| 1| USA| 12424| AB+|
| 2| Japan| 63252| B-|
| 3|Ireland| 23655| null|
| 4| France| 57366| O+|
| 5|Ireland| 82351| A-|
| 6| USA| 35854| B+|
| 7|Ireland| 5835| AB-|
| 8| USA| 95255| B+|
+---+-------+--------+----------+
I need to make changes in columns postcode
& bloodgroup
according to the following conditions, as summarised in this rough python pseudocode
-
# Efficient (pseudocode 1)
if country == 'Ireland':
if length(postcode) == 4:
postcode = '0'+postcode # Append 0 to postcode incase it's 4 digit.
if bloodgroup == null:
bloodgroup = 'Unknown'
As you can see in the pseudocode above, the check country == 'Ireland'
was done only once as it was a common clause in the two conditions. Doing the other way by coupling this clause with the other two conditions using and
would have been inefficient -
# Inefficient (pseudocode 2)
if country == 'Ireland' and length(postcode) == 4:
postcode = '0'+postcode
if country == 'Ireland' and bloodgroup == null:
bloodgroup = 'Unknown'
I am using PySpark
and the only way I know how to do this is as follows -
df = df.withColumn('postcode',when((col('country') == 'Ireland') & (length(col('postcode')) == 4),concat(lit('0'),col('postcode'))).otherwise(col('postcode')))
df = df.withColumn('bloodgroup',when((col('country') == 'Ireland') & col('bloodgroup').isNull(),'Unknown').otherwise(col('bloodgroup')))
df.show()
+---+-------+--------+----------+
| id|country|postcode|bloodgroup|
+---+-------+--------+----------+
| 1| USA| 12424| AB+|
| 2| Japan| 63252| B-|
| 3|Ireland| 23655| Unknown|
| 4| France| 57366| O+|
| 5|Ireland| 82351| A-|
| 6| USA| 35854| B+|
| 7|Ireland| 05835| AB-|
| 8| USA| 95255| B+|
+---+-------+--------+----------+
But, this corresponds to the inefficient pseudocode I wrote above, because we are checking country == 'Ireland'
two times. I have checked the executionPlan
using df.explain()
and it does not do any automatic optimization, which I thought catalyst might to.
How can we write a PySpark
code corresponding to pseudocode 1, where we do the check of country once and then test the 2 conditions?