6
+---+------------+
|  A|           B|
+---+------------+
| x1|        [s1]|
| x2|   [s2 (A2)]|
| x3|   [s3 (A3)]|
| x4|   [s4 (A4)]|
| x5|   [s5 (A5)]|
| x6|   [s6 (A6)]|
+---+------------+

The desired result:

+---+------------+-------+
|A  |B           |value  |
+---+------------+-------+
|x1 |[s1]        |[s1]   |
|x2 |[s2 (A2)]   |[s2]   |
|x3 |[s3 (A3)]   |[s3]   |
|x4 |[s4 (A4)]   |[s4]   |
|x5 |[s5 (A5)]   |[s5]   |
|x6 |[s6 (A6)]   |[s6]   |
+---+------------+-------+

When I applied each of the codes below, the parentheses and the whitespace before them were not replaced:

from pyspark.sql.functions import expr
df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \\(A.\\)', ''))''')).show(truncate=False)

or

df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)

The obtained result:

+---+------------+------------+
|A  |B           |value       |
+---+------------+------------+
|x1 |[s1]        |[s1]        |
|x2 |[s2 (A2)]   |[s2 ()]     |
|x3 |[s3 (A3)]   |[s3 ()]     |
|x4 |[s4 (A4)]   |[s4 ()]     |
|x5 |[s5 (A5)]   |[s5 ()]     |
|x6 |[s6 (A6)]   |[s6 ()]     |
+---+------------+------------+
Sadek
  • 127
  • 1
  • 8
  • When giving an example it is almost always helpful to show the *desired result* before moving on to other parts of the question. Here you refer to "replace parentheses" without saying what the replacement is. Your code suggests it is empty strings. In other words, you wish to remove parentheses. (I could be wrong.) Moreover, you haven't said whether you want the formatting of other lines adjusted accordingly. All that would be mute if you merely showed the desired result (before "When I applied..."). – Cary Swoveland Jun 27 '20 at 18:44
  • 1
    is `df.withColumn("C", F.regexp_replace('B', '[\\(\\)]', ''))` what you are looking for? – werner Jun 27 '20 at 18:47
  • @ werner, That is applied on columns with String values, but I want it for array values. Thanks – Sadek Jun 27 '20 at 18:56
  • 1
    @Cary Swoveland, thank you for your comment, I edited the question and showed the desired result – Sadek Jun 27 '20 at 18:59
  • If I understand you correctly, you want to apply `regexp_replace` to each element of the arrays individually. Am I right? – werner Jun 27 '20 at 19:07
  • @werner, Exactly. I added an example of the expected result – Sadek Jun 27 '20 at 19:13
  • try `df.withColumn("value", F.expr("""transform(B,x-> regexp_replace(x,"\ (.*\)",''))"""))` – murtihash Jun 27 '20 at 19:43

2 Answers2

2

You can create a UDF that removes all elements from the array that match the regular expression r"\(.*\)". If necessary, you can change the regexp to match r"\(A.\)" if that is required.

import re
replaced = F.udf(lambda arr: [s for s in arr if not re.compile(r"\(.*\)").match(s)], \
                 T.ArrayType(T.StringType()))
df.withColumn("value", replaced("B")).show()
werner
  • 13,518
  • 6
  • 30
  • 45
1

You can split the array value and get only the first index from the array.

  • (or) using regexp_replace function.

Example:

df.show()
#+---+---------+
#|  A|        B|
#+---+---------+
#| x1|     [s1]|
#| x2|[s2 (A2)]|
#+---+---------+

df.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: array (nullable = true)
# |    |-- element: string (containsNull = true)

df.withColumn("C",expr('''transform(B,x -> split(x,"\\\s+")[0])''')).show()

#using regexp_replace function
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+.*)",""))''')).show()
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+\\\((?i)A.+\\\))",""))''')).show()
#+---+---------+----+
#|  A|        B|   C|
#+---+---------+----+
#| x1|     [s1]|[s1]|
#| x2|[s2 (A2)]|[s2]|
#+---+---------+----+
notNull
  • 30,258
  • 4
  • 35
  • 50