I have dataframe 'regexDf' like below
id,regex
1,(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
If the length of the regex exceeds some max length for example 50, then i want to remove the last text token in splitted regex string separated by '|' for the exceeded id. In the above data frame, id 1 length is more than 50 so that last tokens 'text4(.)' and 'text6(.)' from each splitted regex string should be removed. Even after removing that also length of the regex string in id 1 still more than 50, so that again last tokens 'text3(.)' and 'text5(.)' should be removed.so the final dataframe will be
id,regex
1,(.*)text1(.*)text2(.*)|(.*)text2(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
I am able to trim the last tokens using the following code
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
I tried using while loop to check the length and trim the text tokens in iteration which is not working. Also i want to avoid using var and while loop. Is it possible to achieve without while loop.
val optimizeRegexString = udf((regex: String) => {
if(regex.length >= 50) {
var len = regex.length;
var resultStr: String = ""
while(len >= maxLength) {
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex
.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
len = reducedStr.length
resultStr = reducedStr
}
resultStr
} else {
regex
}
})
regexDf.withColumn("optimizedRegex", optimizeRegexString(col("regex")))
As per SathiyanS and Pasha suggestion, I changed the recursive method as function.
def optimizeRegex(regexDf: DataFrame): DataFrame = {
val shrinkString= (s: String) => {
if (s.length > 50) {
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
extractedString
}
else s
}
def shrinkUdf = udf((regex: String) => shrinkString(regex))
regexDf.withColumn("regexString", shrinkUdf(col("regex")))
}
Now i am getting exception as "recursive value shrinkString needs type"
Error:(145, 39) recursive value shrinkString needs type
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"));