0

i'm trying to split String in a DataFrame column using SparkSQL and Scala, and there seems to be a difference in the way the split condition is working the two

Using Scala,

This works -

val seq = Seq("12.1")
val df = seq.toDF("val")

Scala Code ->

val seq = Seq("12.1")
val df = seq.toDF("val")

val afterSplit = df2.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne"))
afterSplit.show(false)

However, in Spark SQL when i use this, firstParkSQL shows a Blank.

df.registerTempTable("temp")
val s1 = sqlContext.sql("select split(val, '\\.')[0] as firstPartSQL from temp")

Instead, when i use this (separate condition represented as [.] instead of \. expected value shows up.

val s1 = sqlContext.sql("select split(val, '[.]')[0] as firstPartSQL from temp")

Any ideas why this is happening ?

Karan Alang
  • 869
  • 2
  • 10
  • 35
  • That's because SQL requires additional escaping, i.e. `select split(val, '\\\\.')[0] as firstPartSQL from temp` – zero323 Jan 08 '19 at 00:39

1 Answers1

1

When you use regex patterns in spark-sql with double quotes spark.sql("....."),it is considered as string within another string, so two things happen. Consider this

scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]

scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
|     12|
+-------+


scala> df.createOrReplaceTempView("temp")

With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).

But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function, So you need to get \\. before you use that in the spark-sql

The way to get that is to add 2 more \

scala> "\\."
res12: String = \.

scala> "\\\\."
res13: String = \\.

scala>

If you just pass "\\." in the spark-sql, first it gets converted into \. and then to ".", which in regex context becomes (.) "any" character i.e split on any character, and since each character is adjacent to each other, you get an array of empty string. The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the empty string. When you issue split(val, '\.,')[5], you'll get null

To verify this, you can pass the same delimiter string "\\." to regex_replace() function and see what happens

scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|            |  9999|
+------------+------+

scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala>

If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.

scala> raw"\\."
res23: String = \\.

scala>

scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
|          12|  1291|
+------------+------+


scala>
stack0114106
  • 8,534
  • 3
  • 13
  • 38