1

I have a spark dataframe that looks like this:

    +----+------+-------------+
    |user| level|value_pair   |
    +----+------+-------------+
    | A  | 25   |(23.52,25.12)|
    | A  | 6    |(0,0)        |
    | A  | 2    |(11,12.12)   |
    | A  | 32   |(17,16.12)   |
    | B  | 22   |(19,57.12)   |
    | B  | 42   |(10,3.2)     |
    | B  | 43   |(32,21.0)    |
    | C  | 33   |(12,0)       |
    | D  | 32   |(265.21,19.2)|
    | D  | 62   |(57.12,50.12)|
    | D  | 32   |(75.12,57.12)|
    | E  | 63   |(0,0)        |
    +----+------+-------------+

How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.

    +----+------+-------------+-------+
    |user| level|value1       |value2 |
    +----+------+-------------+-------+
    | A  | 25   |23.52        |25.12  |
    | A  | 6    |0            |0      |
    | A  | 2    |11           |12.12  |
    | A  | 32   |17           |16.12  |
    | B  | 22   |19           |57.12  |
    | B  | 42   |10           |3.2    |
    | B  | 43   |32           |21.0   |
    | C  | 33   |12           |0      |
    | D  | 32   |265.21       |19.2   |
    | D  | 62   |57.12        |50.12  |
    | D  | 32   |75.12        |57.12  |
    | E  | 63   |0            |0      |
    +----+------+-------------+-------+

I know I can separate the values like so:

    df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
    df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]

But how do I also get rid of the parantheses?

sophocles
  • 13,593
  • 3
  • 14
  • 33
sampeterson
  • 459
  • 4
  • 16
  • 1
    if i'm not mistaken, [this](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) answer is exactly what you need. – sophocles Apr 20 '21 at 13:41
  • 1
    Almost. How do I remove the parentheses? – sampeterson Apr 20 '21 at 15:11
  • 1
    @sampeterson Remove parentheses with `regex_replace` function like in this question: https://stackoverflow.com/questions/37038014/pyspark-replace-strings-in-spark-dataframe-column – Emer Apr 20 '21 at 15:25

1 Answers1

5

For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.

Also, I believe you need to first remove the brackets, and then expand the column.

from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace

df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))

df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
       .withColumn('value2', split(df['value_pair'], ',').getItem(1))

>>> df.show(truncate=False)

+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2   |
+----+-----+-----------+------+---------+
| A  |25   |23.52,25.12|23.52 |25.12    |
| A  |6    |0,0        |0     |0        |
| A  |2    |11,12.12   |11    |12.12    |
| A  |32   |17,16.12   |17    |16.12    |
| B  |22   |19,57.12   |19    |57.12    |
| B  |42   |10,3.2     |10    |3.2      |
| B  |43   |32,21.0    |32    |21.0     |
| C  |33   |12,0       |12    |0        |
| D  |32   |265.21,19.2|265.21|19.2     |
| D  |62   |57.12,50.12|57.12 |50.12    |
| D  |32   |75.12,57.12|75.12 |57.12    |
| E  |63   |0,0        |0     |0        |
+----+-----+-----------+------+---------+

As noticed, I changed slightly your code on how you grab the 2 items.

More information can be found here

sophocles
  • 13,593
  • 3
  • 14
  • 33