3

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.

ZygD
  • 22,092
  • 39
  • 79
  • 102
N. P.
  • 170
  • 1
  • 3
  • 11

4 Answers4

5

You can use locate. You need to subtract 1 because string index starts from 1, not 0.

import pyspark.sql.functions as F

df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))

df2.show(truncate=False)
+-------------------------+-------+--------+
|text                     |subtext|position|
+-------------------------+-------+--------+
|Where is my string?      |is     |6       |
|Hm, this one is different|on     |9       |
+-------------------------+-------+--------+
mck
  • 40,932
  • 13
  • 35
  • 50
3

Another way using position SQL function :

from pyspark.sql.functions import expr

df1 = df.withColumn('position', expr("position(subtext in text) -1"))

df1.show(truncate=False)

#+-------------------------+-------+--------+
#|text                     |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string?      |is     |6       |
#|Hm, this one is different|on     |9       |
#+-------------------------+-------+--------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • 1
    This is as correct as mck's answer, but I've opted to give him the check mark because he answered before you. The bit I was missing that appears in both solutions is the use of "expr" – N. P. Jan 21 '21 at 19:32
0
pyspark.sql.functions.instr(str, substr)

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))
nobody
  • 10,892
  • 8
  • 45
  • 63
  • 1
    `pyspark.sql.functions.instr` expects a string as second argument. it must be used in `expr` to pass a column. Also, the index returned is 1-based, the OP wants 0-based. – blackbishop Jan 21 '21 at 16:45
  • I am trying your approach like below `display(df.withColumn("columnname", substring(col("columnname"), 0, instr(col("columnname"), ","))))` but I am getting "TypeError: Column is not iterable" error. Can you please help – Dev Anand Feb 13 '23 at 07:31
0

You can use locate itself. The problem is first parameter of locate (substr) should be string.

So you can use expr function to convert column to string

Please find the correct code as below:

df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))