0

I am splitting a Spark data frame column (String column) to several columns with following :

val dfSplitted = df.withColumn("splc", split(col("c_001"), "\\[|\\[.*?\\]\\|(^;\\])")).select(col("*") +: (0 until 12).map(i => col("splc").getItem(i).as(s"spl_c$i")): _*).drop("splc","c_001")

Here the String column(c_001) has a structure similar to following.

Str|[ts1:tssub2|ts1:tssub2]|BLANK|[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]|[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]|BLANK |BLANK |[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]|[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]|BLANK |BLANK |[s14|s15] 

I want split columns (spl_c0 to spl_c11) to be like (represented as rows)

Str
[ts1:tssub2|ts1:tssub2]
BLANK
[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]
[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]
BLANK
BLANK
[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]
[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]
BLANK
BLANK
[s14|s15]

Here spl_c7 : [s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19] could have one or more repetitions like [s2|s3|s4|INT16|INT17] with different values. In this case two repetitions with semi-colon as the seperator.

Output columns (represented as rows) :

Str|
ts1:tssub2|ts1:tssub2]|BLANK|
INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|
INT3|s1]]|
INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]|BLANK |BLANK |
s2|s3|s4|INT16|INT17];
s5|s6|s7|INT18|INT19]|

s8|s9|s10|INT20|INT21]|ts1:tssub2| | ];

s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]|BLANK |BLANK |
s14|s15]

I am wondering whether why my split won't give the desired result, also would there be an another approach to this (especially considering the performance) ?

  • You can split using custom `udf` function. https://stackoverflow.com/questions/40931278/how-to-use-dataframe-explode-with-a-custom-udf-to-split-a-string-into-substrings – Rumesh Krishnan Apr 09 '18 at 02:01

1 Answers1

0

The issue is with the regex itself. Writing a regular expression to match all the criteria seems to be tough ask(at least for my self). So instead of splitting string with a regex figured it out that using a function to split seems to be a better approach. Although Faced few hiccups , however with the help of StackOverFlow community able to figured it out.