Replace parentheses in pyspark with replace_regex

Question

+---+------------+
|  A|           B|
+---+------------+
| x1|        [s1]|
| x2|   [s2 (A2)]|
| x3|   [s3 (A3)]|
| x4|   [s4 (A4)]|
| x5|   [s5 (A5)]|
| x6|   [s6 (A6)]|
+---+------------+

The desired result:

+---+------------+-------+
|A  |B           |value  |
+---+------------+-------+
|x1 |[s1]        |[s1]   |
|x2 |[s2 (A2)]   |[s2]   |
|x3 |[s3 (A3)]   |[s3]   |
|x4 |[s4 (A4)]   |[s4]   |
|x5 |[s5 (A5)]   |[s5]   |
|x6 |[s6 (A6)]   |[s6]   |
+---+------------+-------+

When I applied each of the codes below, the parentheses and the whitespace before them were not replaced:

from pyspark.sql.functions import expr
df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \\(A.\\)', ''))''')).show(truncate=False)

or

df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)

The obtained result:

+---+------------+------------+
|A  |B           |value       |
+---+------------+------------+
|x1 |[s1]        |[s1]        |
|x2 |[s2 (A2)]   |[s2 ()]     |
|x3 |[s3 (A3)]   |[s3 ()]     |
|x4 |[s4 (A4)]   |[s4 ()]     |
|x5 |[s5 (A5)]   |[s5 ()]     |
|x6 |[s6 (A6)]   |[s6 ()]     |
+---+------------+------------+

When giving an example it is almost always helpful to show the *desired result* before moving on to other parts of the question. Here you refer to "replace parentheses" without saying what the replacement is. Your code suggests it is empty strings. In other words, you wish to remove parentheses. (I could be wrong.) Moreover, you haven't said whether you want the formatting of other lines adjusted accordingly. All that would be mute if you merely showed the desired result (before "When I applied..."). — Cary Swoveland, Jun 27 '20 at 18:44
is `df.withColumn("C", F.regexp_replace('B', '[\\(\\)]', ''))` what you are looking for? — werner, Jun 27 '20 at 18:47
@ werner, That is applied on columns with String values, but I want it for array values. Thanks — Sadek, Jun 27 '20 at 18:56
@Cary Swoveland, thank you for your comment, I edited the question and showed the desired result — Sadek, Jun 27 '20 at 18:59
If I understand you correctly, you want to apply `regexp_replace` to each element of the arrays individually. Am I right? — werner, Jun 27 '20 at 19:07
try `df.withColumn("value", F.expr("""transform(B,x-> regexp_replace(x,"\ (.*\)",''))"""))` — murtihash, Jun 27 '20 at 19:43

werner · Answer 1 · 2020-06-27T21:37:56.927

2

You can create a UDF that removes all elements from the array that match the regular expression r"\(.*\)". If necessary, you can change the regexp to match r"\(A.\)" if that is required.

import re
replaced = F.udf(lambda arr: [s for s in arr if not re.compile(r"\(.*\)").match(s)], \
                 T.ArrayType(T.StringType()))
df.withColumn("value", replaced("B")).show()

edited Jun 27 '20 at 21:37

answered Jun 27 '20 at 19:32

werner

13,518
6
30
45

notNull · Accepted Answer · 2020-06-27T19:38:47.310

You can split the array value and get only the first index from the array.

(or) using regexp_replace function.

Example:

df.show()
#+---+---------+
#|  A|        B|
#+---+---------+
#| x1|     [s1]|
#| x2|[s2 (A2)]|
#+---+---------+

df.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: array (nullable = true)
# |    |-- element: string (containsNull = true)

df.withColumn("C",expr('''transform(B,x -> split(x,"\\\s+")[0])''')).show()

#using regexp_replace function
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+.*)",""))''')).show()
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+\\\((?i)A.+\\\))",""))''')).show()
#+---+---------+----+
#|  A|        B|   C|
#+---+---------+----+
#| x1|     [s1]|[s1]|
#| x2|[s2 (A2)]|[s2]|
#+---+---------+----+

Replace parentheses in pyspark with replace_regex

2 Answers2