When you use regex patterns in spark-sql with double quotes spark.sql(".....")
,it is considered as string within another string, so two things happen. Consider this
scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]
scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
| 12|
+-------+
scala> df.createOrReplaceTempView("temp")
With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).
But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function,
So you need to get \\.
before you use that in the spark-sql
The way to get that is to add 2 more \
scala> "\\."
res12: String = \.
scala> "\\\\."
res13: String = \\.
scala>
If you just pass "\\."
in the spark-sql, first it gets converted into \.
and then to ".", which in regex context becomes (.) "any" character
i.e split on any character, and since each character is adjacent to each other, you get an array of empty string.
The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the
empty string. When you issue split(val, '\.,')[5], you'll get null
To verify this, you can pass the same delimiter string "\\."
to regex_replace() function and see what happens
scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| | 9999|
+------------+------+
scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>
If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.
scala> raw"\\."
res23: String = \\.
scala>
scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>