0

I am using pyspark and regex_extract to create a new column:

df.withColumn("go", F.regexp_extract("fields", '"go":"([A-Za-z0-9]*)"', 1))

"fields" is a column with dictuinary values. The value in it looks like:

{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}

The thing is that there are two "go" in "fields". Using the above code, it returns the first value ("NEW123"). I only want the second one's value to be returned (so I want "BYE89" to be returned). How can i do that here?

Thanks!

LLL
  • 419
  • 2
  • 6
  • 17
  • 4
    Try `'.*"go":"([A-Za-z0-9]*)"'` – Wiktor Stribiżew Aug 01 '20 at 19:30
  • 1
    @WiktorStribiżew You have to write this as an answer. – Rahul K P Aug 01 '20 at 19:31
  • You may want to take a look at the `object_pairs_hook` parameter of the `json.JSONDecoder` class -- https://stackoverflow.com/a/29322077 this way you can avoid using regexes, since you are dealing with valid JSON anyway – Daniel F Aug 01 '20 at 19:49
  • 1
    I think this is an overzealous duplicate mark since _last in a string_ is a relative term that has many methods to check for and the OP stated 2nd, not last.. If it's last just use `(?<="go":")[^"]*(?="})` –  Aug 01 '20 at 20:33
  • If it's 2nd, use @Shu way. –  Aug 01 '20 at 20:34
  • well in this case the 2nd happens to be the last one. But thank you all for the answers. – LLL Aug 01 '20 at 21:21

1 Answers1

1

Try with "go".*?"go":"(.*)" regex.

df.withColumn("go",regexp_extract(col("fields"),'"go".*?"go":"(.*)"',1)).show(10,False)
df.withColumn("go",regexp_extract(col("fields"),'"go".*?"go":"([A-Za-z0-9]*)"',1)).show(10,False)
#+--------------------------------------------+-----+
#|fields                                      |go   |
#+--------------------------------------------+-----+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|BYE89|
#+--------------------------------------------+-----+

Another way would be using from_json function:

Second occurrence of go will overwrite the first occurrence (same as python dict) so we will have only one value for go.

df.show(10,False)
#+--------------------------------------------+
#|fields                                      |
#+--------------------------------------------+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|
#+--------------------------------------------+

from pyspark.sql.types import *
from pyspark.sql.functions import *

sch=StructType([StructField("go",StringType()),StructField("hey",StringType())])

df.withColumn("go",from_json(col("fields"),sch)).\
withColumn("go",col("go.go")).show(10,False)
#+--------------------------------------------+-----+
#|fields                                      |go   |
#+--------------------------------------------+-----+
#|{"go":"NEW123", "hey":"OLD32", "go":"BYE89"}|BYE89|
#+--------------------------------------------+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50