How to find position of substring in another column of dataframe using spark scala

Question

I have a Spark scala DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:


+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex. Found an answer here that works with pyspark. I am looking to use similar solution in scala. How to find position of substring column in a another column using PySpark?

Did you resolve your issues based on answer? – thebluephantom Apr 30 '22 at 08:20 — thebluephantom, Apr 30 '22 at 08:20

thebluephantom · Answer 1 · 2022-04-30T14:44:45.600

This works:

import org.apache.spark.sql.functions._ 

val df = Seq(
  ("beatles", "hey jude"),
  ("romeo", "eres mia")
).toDF("name", "hit_songs")

val df2 = df.withColumn("answer", locate("ju", col("hit_songs"), pos=1))
df2.show(false)

Gets first find position from or equal to the pos you specify, not all occurrences.

UPDATE

You can adapt to use col as opposed to literal, that goes without saying. However, the stuff on Columns and using locate etc. does not work well. E.g.

val df2 = df.withColumn("answer", locate(col("search_string"), col("hit_songs"), pos=1))

this does not work. I find this aspect hard to explain, but I cannot cast col search_string to string as we have this Column aspect to consider.

So, this is what you need: a UDF reverting to Scala functions:

import org.apache.spark.sql.functions._  
import org.apache.spark.sql.Column 
import spark.implicits._
 
val df = Seq(("beatles", "hey jude", "ju"), ("romeo", "eres mia", "es") ).toDF("name", "hit_songs", "search_string")
 
def ggg = udf((c1: String, c2: String) => {
   c2.indexOf(c1)  
} )
 
  
df.withColumn("value",ggg(df("search_string") ,df("hit_songs"))).show(false)

Note you may add 1 to result as it starts at 0.

Interesting to note the contrast with this question and answer: Pass dataframe column name as parameter to the function using scala?

In any event:

 df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))

does not work.

Thanks. Yeah, creating an UDF works. I was more interested to know if there's way to use column in locate function. We can pass column in the location function in PySpark. I haven't found a way in Scala yet. — user18913657, May 02 '22 at 14:16
Thanks for taking the time to try these out. Really appreciate it. — user18913657, May 02 '22 at 14:17
No you can't with Scala. Believe me I looked. I find it hard to explain but it is not possible afaics. To do with Column. If u google on SO u will see that. Th3 answer is correct. — thebluephantom, May 02 '22 at 15:27

How to find position of substring in another column of dataframe using spark scala

1 Answers1