2

I am trying to join two dataframe with condition like "Wo" in "Hello World" i.e (dataframe1 col contains dataframe2 col1 value).

In HQL, we can use instr(t1.col1,t2.col1)>0

How can I achieve this same condtition in Dataframe in Scala ? I tried

df1.join(df2,df1("col1").indexOfSlice(df2("col1")) > 0)

But it throwing me the below error

error: value indexOfSlice is not a member of org.apache.spark.sql.Column

I just want to achive the below hql query using DataFrames.

select t1.*,t2.col1 from t1,t2 where instr(t1.col1,t2.col1)>0
sparker
  • 1,666
  • 4
  • 21
  • 35
  • First, this is a spark specific question and it should be tagged with the [tag:apache-spark]. Secondly, It's not very clear what it is you are doing. Are you actually trying to perform a cross join with a condition ? Also wouldn't you care presenting a MVCE so we can try to help ? The error is obvious thought – eliasah Nov 02 '17 at 13:39
  • @eliasah Yes, I am trying to do cross join. I will add sql example in to the question... – sparker Nov 02 '17 at 13:41

1 Answers1

2

The following solution is tested with spark 2.2. You'll be needing to define a UDF and you can specify a join condition as part of where filter :

val indexOfSlice_ = (c1: String, c2: String) => c1.indexOfSlice(c2)
val islice = udf(indexOfSlice_)

val df10: DataFrame = Seq(("Hello World", 2), ("Foo", 3)).toDF("c1", "c2")
val df20: DataFrame = Seq(("Wo", 2), ("Bar", 3)).toDF("c3", "c4")

df10.crossJoin(df20).where(islice(df10.col("c1"), df20.col("c3")) > 0).show
// +-----------+---+---+---+
// |         c1| c2| c3| c4|
// +-----------+---+---+---+
// |Hello World|  2| Wo|  2|
// +-----------+---+---+---+

PS: Beware ! Using a cross-join is an expensive operation as it yields a cartesian join.

EDIT: Consider reading this when you want to use this solution.

eliasah
  • 39,588
  • 11
  • 124
  • 154