0

In spark API:

column.like("only takes a static string with optional wildcards like %")
column.contains(accepts_a_column_but_wont_parse_wildcards)

So what's the equivalent method to call to compare values using wildcards that might show up in a string value from a column found in the join?

example that fails because like() accepts literal strings, not Column:

.join(other_df, column.like(concat("%", $"column_potentially_with_wildcards", "%")), "left")

?

Rimer
  • 2,054
  • 6
  • 28
  • 43

1 Answers1

1

Looking at the code, like() appears to only accept a literal value as a convenience. Hopefully they'll expand this in a future release, but for now you can create your own function to compensate:

import org.apache.spark.sql.catalyst.expressions.Like
import org.apache.spark.sql.Column

def columnLike(a : Column, b : Column) : Column = new Column( Like(a.expr, b.expr))

...

scala> val df1 = List("aaaa", "bbbb", "aaaabbbbcccc", "abcd", "abc").toDS()
df1: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df2 = List("a%b%c").toDS()
df2: org.apache.spark.sql.Dataset[String] = [value: string]

scala> df1.join(df2, columnLike(df1("value"), df2("value"))).show
+------------+-----+                                                            
|       value|value|
+------------+-----+
|aaaabbbbcccc|a%b%c|
|         abc|a%b%c|
+------------+-----+


Charlie Flowers
  • 1,287
  • 7
  • 12
  • nice! I actually solved it with `.join(dfToJoin, expr("col_a LIKE CONCAT('%', fuzzy_matching_col, '%')"))` Why do I not have to register `isLike()` as a UDF to use it? – Rimer Jun 12 '19 at 00:16
  • 1
    Ooh, good one - my first instinct was to convert the SQL wildcards in `fuzzy_matching_col` to a Java regex and use `regexp_extract` as a pseudo-like. I'm curious if there are any performance or planning differences. I don't have a ton of experience with Spark UDFs, but from what I've read they seem to translate functions operating on regular language types (`Long`s, etc.) to work on `Column`s. Since my function is `Column`s in, `Column`s out, there's no need. (bonus: my answer as a one-liner: `df1.join(df2, new Column(Like(df1("value").expr, df2("value").expr).show` – Charlie Flowers Jun 13 '19 at 22:55