1

I have written a pyspark code doing following operation but its not working as intended. Can anyone point out my mistake please

# Data cleaning function
def clean_data(data):
    rep = data.replace('/','')
    rep = data.replace('-','')
    rep = data.replace('+','')
    rep = data.replace(' ','')
    return  rep

#clean_data_udf_int = udf(lambda z: clean_data(z), StringType())
#con.show(4)
clean_data_udf = udf(clean_data, StringType())
con = con.withColumn('ph1_f',clean_data_udf('phone1'))

Input dataframe is con:

id phone phone1
1  098     /90
2  + 91    -90

Output i want dataframe is :

id phone phone1
1  98     90
2  91     90
Tilo
  • 409
  • 1
  • 5
  • 14

2 Answers2

3

In this case, you're better off using pyspark.sql.functions.regexp_replace() instead of using a udf.

from pyspark.sql.functions import col, regexp_replace

def clean_data(data):
    rep = regexp_replace(data, "[\/\-\+ ]", '')
    rep = regexp_replace(rep, "^0", '')
    return rep

df = df.select(
    "id", 
    clean_data(col("phone")).alias("phone"),
    clean_data(col("phone1")).alias("phone1")
)
df.show()
#+---+-----+------+
#| id|phone|phone1|
#+---+-----+------+
#|  1|   98|    90|
#|  2|   91|    90|
#+---+-----+------+

Since some of the characters you want to replace have special meaning in regular expressions, they need to be escaped with a \.

The first pattern means:

  • [\/\-\+ ]: Match a single character present in ["/", "-", "+", " "]

The second replace pattern means:

  • ^0: Replace a 0 starting at the beginning of the string. You can change this to ^0+ if you expect there to be multiple leading 0s.
pault
  • 41,343
  • 15
  • 107
  • 149
2

You are reassigning rep each time you use replace, instead of replacing more things inside rep after the first assignment;

def clean_data(data):
    rep = data.replace('/','')

    # Now start replacing more things in 'rep' and not the original 'data'
    rep = rep.replace('-','')  
    rep = rep.replace('+','')
    rep = rep.replace(' ','')

    return  rep

Otherwise on your last call rep = data.replace(' ','') rep is the equivalent of the original data but with spaces removed. The other replace calls have no effect.

Nordle
  • 2,915
  • 3
  • 16
  • 34
  • Thanks, I changed as above, but i still see few instances where "+" is not replaced and still present in records after doing above changes – Tilo May 21 '19 at 12:47