How to find position of substring column in a another column using PySpark?

Question

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.

score 5 · Accepted Answer · answered Jan 21 '21 at 16:10

You can use locate. You need to subtract 1 because string index starts from 1, not 0.

import pyspark.sql.functions as F

df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))

df2.show(truncate=False)
+-------------------------+-------+--------+
|text                     |subtext|position|
+-------------------------+-------+--------+
|Where is my string?      |is     |6       |
|Hm, this one is different|on     |9       |
+-------------------------+-------+--------+

score 3 · Answer 2 · answered Jan 21 '21 at 16:30

3

Another way using position SQL function :

from pyspark.sql.functions import expr

df1 = df.withColumn('position', expr("position(subtext in text) -1"))

df1.show(truncate=False)

#+-------------------------+-------+--------+
#|text                     |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string?      |is     |6       |
#|Hm, this one is different|on     |9       |
#+-------------------------+-------+--------+

answered Jan 21 '21 at 16:30

blackbishop

30,945
11
55
76

1

This is as correct as mck's answer, but I've opted to give him the check mark because he answered before you. The bit I was missing that appears in both solutions is the use of "expr" – N. P. Jan 21 '21 at 19:32

score 0 · Answer 3 · answered Jan 21 '21 at 16:24

0

pyspark.sql.functions.instr(str, substr)

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))

answered Jan 21 '21 at 16:24

nobody

10,892
8
45
63

1

`pyspark.sql.functions.instr` expects a string as second argument. it must be used in `expr` to pass a column. Also, the index returned is 1-based, the OP wants 0-based. – blackbishop Jan 21 '21 at 16:45
I am trying your approach like below `display(df.withColumn("columnname", substring(col("columnname"), 0, instr(col("columnname"), ","))))` but I am getting "TypeError: Column is not iterable" error. Can you please help – Dev Anand Feb 13 '23 at 07:31

score 0 · Answer 4 · answered May 04 '22 at 10:05

You can use locate itself. The problem is first parameter of locate (substr) should be string.

So you can use expr function to convert column to string

Please find the correct code as below:

df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))

How to find position of substring column in a another column using PySpark?

4 Answers4

Linked