1

I have a Spark scala DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:


+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex. Found an answer here that works with pyspark. I am looking to use similar solution in scala. How to find position of substring column in a another column using PySpark?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

0

This works:

import org.apache.spark.sql.functions._ 

val df = Seq(
  ("beatles", "hey jude"),
  ("romeo", "eres mia")
).toDF("name", "hit_songs")

val df2 = df.withColumn("answer", locate("ju", col("hit_songs"), pos=1))
df2.show(false)

Gets first find position from or equal to the pos you specify, not all occurrences.

UPDATE

You can adapt to use col as opposed to literal, that goes without saying. However, the stuff on Columns and using locate etc. does not work well. E.g.

val df2 = df.withColumn("answer", locate(col("search_string"), col("hit_songs"), pos=1))

this does not work. I find this aspect hard to explain, but I cannot cast col search_string to string as we have this Column aspect to consider.

So, this is what you need: a UDF reverting to Scala functions:

import org.apache.spark.sql.functions._  
import org.apache.spark.sql.Column 
import spark.implicits._
 
val df = Seq(("beatles", "hey jude", "ju"), ("romeo", "eres mia", "es") ).toDF("name", "hit_songs", "search_string")
 
def ggg = udf((c1: String, c2: String) => {
   c2.indexOf(c1)  
} )
 
  
df.withColumn("value",ggg(df("search_string") ,df("hit_songs"))).show(false)

Note you may add 1 to result as it starts at 0.

Interesting to note the contrast with this question and answer: Pass dataframe column name as parameter to the function using scala?

In any event:

 df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))

does not work.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83