0

What is the best way to apply the following transformation to a column in a dataframe in Spark 2.40 Scala. I was attempting udf or regex replace but could not achieve the desired outcome.

column_1 in the below example is a string.

Data frame Input:

column_1
#;#;Runner#;#;bob
#;#;#;

Desired Result

column_1
null#;null#;Runner#;null#;bob
null#;null#;null

Solution:

Following the suggestions below. This was how I resolved the issue. In this case I replaced with commas first then later replaced the comma delimiter with #;

select_df
  .withColumn("column_1", regexp_replace(col("column_1"), "(?<![a-zAZ0-9]),", "null,"))
  .withColumn("column_1", regexp_replace(col("column_1"), ",]$", ",null"))
  .withColumn("column_1", regexp_replace(col("column_1"), ",", "#;"))
Defcon
  • 807
  • 3
  • 15
  • 36
  • Does this answer your question? [Adding nulls to dataframe output with regexp replace in Spark 2.4](https://stackoverflow.com/questions/68656179/adding-nulls-to-dataframe-output-with-regexp-replace-in-spark-2-4) – werner Aug 06 '21 at 17:25

1 Answers1

1

You can use regexp_replace 2 times to get required output. First, replace #; (only the #; which are not preceeded by Runner using negative look behind regex expression) by null#; and then replace ending #; with empty string.

  val spark = SparkSession.builder().master("local[*]").getOrCreate()
  import spark.implicits._
  spark.sparkContext.setLogLevel("ERROR")

  val df = List("#;#;Runner#;#;bob", "#;#;#;").toDF(("column_1"))

  df.withColumn("column_1", regexp_replace(
    regexp_replace('column_1, "(?<![a-zA-Z]+)#;", "null#;"),"#;$", "")
  )
    .show(false)
+-----------------------------+
|column_1                     |
+-----------------------------+
|null#;null#;Runner#;null#;bob|
|null#;null#;null             |
+-----------------------------+
Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • What if the value is dynamic? Runner may not always be present, in the same position. It could be "llama" in a different position as an example – Defcon Aug 06 '21 at 19:19
  • You can change regex pattern like this -(?<![a-zA-Z]+)#; – Mohana B C Aug 06 '21 at 19:24