Substring (pyspark.sql.Column.substr) with restrictions

Question

I have a code for example C78907. I want to split it:

C78     # level 1
C789    # Level2
C7890   # Level 3
C78907  # Level 4

So far what I m using:

Df3 = Df2.withColumn('Level_One', concat(Df2.code.substr(1, 3)))
Df4 = Df3.withColumn('Level_two', concat(Df3.code.substr(1, 4)))
Df5 = Df4.withColumn('Level_theree', concat(Df4.code.substr(1, 5)))
Df6 = Df5.withColumn('Level_four', concat(Df5.code.substr(1, 6)))

The issue is when looking to the results, the codes of level four ( supposed to be 6 components) might contain codes from level one or two or three.

721 7213    7213    7213
758 7580    7580    7580
724 7242    7242    7242
737 7373    73730   73730
789 7895    78959   78959
V06 V061    V061    V061
381 3810    38100   38100

Ideally a restriction could be useful. I mean:

For level one keep only 3 components.
For level two 4 components and NOT LESS.
For level three 5 components and NOT LESS.
For level four 6 components and NOT LESS.
If the required number of components don't exist then put null instead of imputing with the previous one.

The desired output :

Initial_code   level1  level2   level3   level4        
 7213           721    7213     null      null
 7580           758    7580     null      null
 7242           724    7242     null      null
 73730          737    7373     73730     null
 38100D         381    3810     38100     38100D

What is your desired output? Can you [edit](https://stackoverflow.com/posts/49432950/edit) your question and add it? It would also be helpful if you could also provide a sample dataframe: [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Mar 22 '18 at 15:48

score 4 · Accepted Answer · answered Mar 22 '18 at 16:07

You can achieve your desired output by using pyspark.sql.Column.when() and pyspark.sql.functions.length(). When creating the column, check if the substring will have the correct length. If it does not, set the column to None using pyspark.sql.functions.lit().

For example:

import pyspark.sql.functions as f
df.withColumn('Level_One', f.when(
        f.length(f.col('code').substr(1, 3)) == 3,
        f.col('code').substr(1, 3)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Two', f.when(
        f.length(f.col('code').substr(1, 4)) == 4,
        f.col('code').substr(1, 4)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Three', f.when(
        f.length(f.col('code').substr(1, 5)) == 5,
        f.col('code').substr(1, 5)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Four', f.when(
        f.length(f.col('code').substr(1, 6)) == 6,
        f.col('code').substr(1, 6)
    ).otherwise(f.lit(None)))\
    .show()

Output:

+------+---------+---------+-----------+----------+
|  Code|Level_One|Level_Two|Level_Three|Level_Four|
+------+---------+---------+-----------+----------+
|  7213|      721|     7213|       null|      null|
|  7580|      758|     7580|       null|      null|
|  7242|      724|     7242|       null|      null|
| 73730|      737|     7373|      73730|      null|
|38100D|      381|     3810|      38100|    38100D|
+------+---------+---------+-----------+----------+

**Clarification**: the `otherwise(f.lit(None))` is not needed here as `when` will return `null` by default if the condition is not met. — pault, May 23 '19 at 13:47

Substring (pyspark.sql.Column.substr) with restrictions

1 Answers1