2

I am trying to implement a query in my Scala code which uses a regexp on a Spark Column to find all the rows in the column which contain a certain value like:

 column.rlike(".*" + str + ".*")

str is a String that can be anything (except null or empty).

This works fine for the basic queries that I am testing. However being new to Spark / Scala, I am unsure of if there are any special cases that could break the code here that I need to take care of. Are there any characters that I need to be escaping or special cases that I need to worry about here?

zero323
  • 322,348
  • 103
  • 959
  • 935
rgamber
  • 5,749
  • 10
  • 55
  • 99

2 Answers2

5

This can be broken by any invalid regexp. You don't even have to try hard:

Seq("[", "foo", " ba.r ").toDF.filter($"value".rlike(".*" + "[ " + ".*")).show

or can give unexpected results if str is a non-trivial pattern itself. For simple cases like this you'll be better with Column.contains:

Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("[")).show
Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("a.r")).show
zero323
  • 322,348
  • 103
  • 959
  • 935
2

You can use rlike as zero suggested and Pattern.quote to handle the special regex characters. Suppose you have this DF:

val df = Seq(
  ("hel?.o"),
  ("bbhel?.o"),
  ("hel?.oZZ"),
  ("bye")
).toDF("weird_string")

df.show()
+------------+
|weird_string|
+------------+
|      hel?.o|
|    bbhel?.o|
|    hel?.oZZ|
|         bye|
+------------+

Here's how to find all the strings that contain "hel?.o".

import java.util.regex.Pattern

df
  .withColumn("has_hello", $"weird_string".rlike(Pattern.quote("hel?.o")))
  .show()
+------------+---------+
|weird_string|has_hello|
+------------+---------+
|      hel?.o|     true|
|    bbhel?.o|     true|
|    hel?.oZZ|     true|
|         bye|    false|
+------------+---------+

You could also add the quote characters manually to get the same result:

df
  .withColumn("has_hello", $"weird_string".rlike("""\Qhel?.o\E"""))
  .show()

If you don't properly escape the regex, you won't get the right result:

df
  .withColumn("has_hello", $"weird_string".rlike("hel?.o"))
  .show()
+------------+---------+
|weird_string|has_hello|
+------------+---------+
|      hel?.o|    false|
|    bbhel?.o|    false|
|    hel?.oZZ|    false|
|         bye|    false|
+------------+---------+

See this post for more details.

Powers
  • 18,150
  • 10
  • 103
  • 108