pyspark count number of underscores in each row of a given column

Question

I am using pyspark version 1.5.2. I have a pyspark dataframe with a column "id" as shown below:

id
------------
000001_128
000123_1_3 
006745_8
000000_9_7

I want to count the number of '_' (underscores) in each row of the DF and perform a when operation such that if there is only 1 underscore in the string, I want to add '_1' as suffix, otherwise leave the value as it is. So the desired result would be :

id          | new_id
------------------------
000001_128  | 000001_128_1
000123_1_3  | 000123_1_3
006745_8    | 006745_8_1
000000_9_7  | 000000_9_7

I am using pyspark.sql.functions for other operations.

Any help is appreciated!

score 2 · Answer 1 · answered Jul 20 '18 at 20:48

2

from pyspark.sql.functions import udf

@udf(returnType='string')
def fmt(s):
    return s if s.count('_')!=1 else f'{s}_1'


df.withColumn('id', fmt(df.id))

answered Jul 20 '18 at 20:48

Arnon Rotem-Gal-Oz

25,469
3
45
68

score 2 · Accepted Answer · answered Jul 20 '18 at 21:12

2

Here's a non-udf approach:

You can use the same methodology from this answer to count the number of _ in each id, and use pyspark.sql.functions.when() to check if the count is equal to 1. If yes, use pyspark.sql.functions.format_string() to make the new_id, otherwise leave the column unchanged:

import pyspark.sql.functions as f

df.withColumn(
    "new_id",
    f.when(
        (f.size(f.split("id", "_"))-1) == 1,
        f.format_string("%s_1",f.col("id"))
    ).otherwise(f.col("id"))
).show()
#+----------+------------+
#|        id|      new_id|
#+----------+------------+
#|000001_128|000001_128_1|
#|000123_1_3|  000123_1_3|
#|  006745_8|  006745_8_1|
#|000000_9_7|  000000_9_7|
#+----------+------------+

answered Jul 20 '18 at 21:12

pault

41,343
15
107
149

I'm curious, why would using 4 UDFs (albeit built-in) passing data back and forth between dfs and regular types be better than one udf – Arnon Rotem-Gal-Oz Jul 24 '18 at 19:37
@ArnonRotem-Gal-Oz it's important that these are built-ins (not user defined *python* functions) - this means that all of the execution can happen inside the JVM (pyspark is just wrapper afterall). If you want to use a python udf, spark has to serialize the dataframe so that the python code can be executed. **Update**: See [this post](https://stackoverflow.com/a/38297050/5858851) for a more detailed explanation. – pault Jul 24 '18 at 19:43

pyspark count number of underscores in each row of a given column

2 Answers2