I have a code for example C78907. I want to split it:
C78 # level 1
C789 # Level2
C7890 # Level 3
C78907 # Level 4
So far what I m using:
Df3 = Df2.withColumn('Level_One', concat(Df2.code.substr(1, 3)))
Df4 = Df3.withColumn('Level_two', concat(Df3.code.substr(1, 4)))
Df5 = Df4.withColumn('Level_theree', concat(Df4.code.substr(1, 5)))
Df6 = Df5.withColumn('Level_four', concat(Df5.code.substr(1, 6)))
The issue is when looking to the results, the codes of level four ( supposed to be 6 components) might contain codes from level one or two or three.
721 7213 7213 7213
758 7580 7580 7580
724 7242 7242 7242
737 7373 73730 73730
789 7895 78959 78959
V06 V061 V061 V061
381 3810 38100 38100
Ideally a restriction could be useful. I mean:
- For level one keep only 3 components.
- For level two 4 components and NOT LESS.
- For level three 5 components and NOT LESS.
- For level four 6 components and NOT LESS.
- If the required number of components don't exist then put
null
instead of imputing with the previous one.
The desired output :
Initial_code level1 level2 level3 level4
7213 721 7213 null null
7580 758 7580 null null
7242 724 7242 null null
73730 737 7373 73730 null
38100D 381 3810 38100 38100D