0

I have a data frame column that contains text like this

col

    0     abc-text1
    1     def_text2-

What I would like to do with pyspark is that if my col startswith 'abc-' then replace it with just 'abc' and if it starts with 'def_' then replace it with def.

I would like to create a function for the same. I am fairley new to python and pyspark thus need help on this.

Pankaj Kaundal
  • 1,012
  • 3
  • 13
  • 25

2 Answers2

1

Assuming your column name is col1, and dataframe is df,

df = df.withColumn('col1', regexp_replace(col('col1'), "^abc", "abc"))
df = df.withColumn('col1', regexp_replace(col('col1'), "^def", "def"))

You can use regular expressions (example here: Regular Expression to match string starting with "stop") to replace the value starting with 'abc' or 'def' with 'def'.

Rob
  • 468
  • 3
  • 15
1

You can use pyspark regexp_replace for this, see code below.

# This replaces all - and _
data = [(1,"abc-text1"), (2,"def_text1-")]
df = sqlContext.createDataFrame(data, ["a","b"])
dfe  = df.withColumn("b_1", F.regexp_replace(F.col("b"), "(-|_)", ""))
fathomson
  • 173
  • 1
  • 8