Find the second match using regex extract

Question

I am using pyspark and regex_extract to create a new column:

df.withColumn("go", F.regexp_extract("fields", '"go":"([A-Za-z0-9]*)"', 1))

"fields" is a column with dictuinary values. The value in it looks like:

{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}

The thing is that there are two "go" in "fields". Using the above code, it returns the first value ("NEW123"). I only want the second one's value to be returned (so I want "BYE89" to be returned). How can i do that here?

Thanks!

You may want to take a look at the `object_pairs_hook` parameter of the `json.JSONDecoder` class -- https://stackoverflow.com/a/29322077 this way you can avoid using regexes, since you are dealing with valid JSON anyway — Daniel F, Aug 01 '20 at 19:49
I think this is an overzealous duplicate mark since _last in a string_ is a relative term that has many methods to check for and the OP stated 2nd, not last.. If it's last just use `(?<="go":")[^"]*(?="})` — , Aug 01 '20 at 20:33
well in this case the 2nd happens to be the last one. But thank you all for the answers. — LLL, Aug 01 '20 at 21:21

notNull · Accepted Answer · 2020-08-01T19:42:40.633

Try with "go".*?"go":"(.*)" regex.

df.withColumn("go",regexp_extract(col("fields"),'"go".*?"go":"(.*)"',1)).show(10,False)
df.withColumn("go",regexp_extract(col("fields"),'"go".*?"go":"([A-Za-z0-9]*)"',1)).show(10,False)
#+--------------------------------------------+-----+
#|fields                                      |go   |
#+--------------------------------------------+-----+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|BYE89|
#+--------------------------------------------+-----+

Another way would be using from_json function:

Second occurrence of go will overwrite the first occurrence (same as python dict) so we will have only one value for go.

df.show(10,False)
#+--------------------------------------------+
#|fields                                      |
#+--------------------------------------------+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|
#+--------------------------------------------+

from pyspark.sql.types import *
from pyspark.sql.functions import *

sch=StructType([StructField("go",StringType()),StructField("hey",StringType())])

df.withColumn("go",from_json(col("fields"),sch)).\
withColumn("go",col("go.go")).show(10,False)
#+--------------------------------------------+-----+
#|fields                                      |go   |
#+--------------------------------------------+-----+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|BYE89|
#+--------------------------------------------+-----+

Find the second match using regex extract

1 Answers1