Dataframe replace each row null values with unique epoch time

Question

I have 3 rows in dataframes and in 2 rows, the column id has got null values. I need to loop through the each row on that specific column id and replace with epoch time which should be unique and should happen in dataframe itself. How can it be done? For eg:

id | name
1    a
null b
null c

I wanted this dataframe which converts null to epoch time.

id     |     name
1             a
1435232       b
1542344       c

what do you mean epoch time? do you just mean a unique number or do you have some requirements on how it is calculated? — Assaf Mendelson, Oct 09 '18 at 09:16
Possible duplicate of [Primary keys with Apache Spark](https://stackoverflow.com/q/33102727) — zero323, Oct 09 '18 at 22:15

Terry Dactyl · Answer 1 · 2018-10-09T14:51:04.157

-1

df
  .select(
    when($"id").isNull, /*epoch time*/).otherwise($"id").alias("id"),
    $"name"
  )

EDIT

You need to make sure the UDF precise enough - if it is only has millisecond resolution you will see duplicate values. See my example below that clearly illustrates my approach works:

scala> def rand(s: String): Double = Math.random
rand: (s: String)Double

scala> val udfF = udf(rand(_: String))
udfF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(StringType)))

scala> res11.select(when($"id".isNull, udfF($"id")).otherwise($"id").alias("id"), $"name").collect
res21: Array[org.apache.spark.sql.Row] = Array([0.6668195187088702,a], [0.920625293516218,b])

edited Oct 09 '18 at 14:51

answered Oct 09 '18 at 09:21

Terry Dactyl

1,839
12
21

How is the epoch time unqiue for each row if you do it like this? – Shaido Oct 09 '18 at 09:37
The OP didn't specify how this would be generated but they would have to call some function returning the necessary value. – Terry Dactyl Oct 09 '18 at 10:06
I need unique for each row .. not same epoch in every row – SRIRAM RAMACHANDRAN Oct 09 '18 at 12:57
1

@TerryDactyl That [is not](https://stackoverflow.com/q/42367464/6910411) a valid way to generate random values in Spark SQL. – zero323 Oct 09 '18 at 22:09
I changed it. I initially chose Math.random to provide a udf that returns a different value each time. System.nanoSeconds is likely to to the same unless you have a very fast machine! – Terry Dactyl Oct 10 '18 at 05:53

stack0114106 · Answer 2 · 2018-10-09T15:22:35.830

Check this out

scala>  val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c"))
s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c))

scala> val df = s1.toDF("id","name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> val epoch = java.time.Instant.now.getEpochSecond
epoch: Long = 1539084285

scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show
+----------+----+
|        id|name|
+----------+----+
|         1|   a|
|1539084285|   b|
|1539084285|   c|
+----------+----+


scala>

EDIT1:

I used milliseconds, then also I get same values. Spark doesn't capture nano seconds in time portion. It is possible that many rows could get the same milliseconds. So your assumption of getting unique values based on epoch would not work.

scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli
getEpoch: (x: String)Long

scala> val myudfepoch = udf( getEpoch(_:String):Long )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+-------------+----+
|           id|name|
+-------------+----+
|            1|   a|
|1539087300957|   b|
|1539087300957|   c|
+-------------+----+


scala>

The only possibility is to use the monotonicallyIncreasingId, but that values may not be of same length all the time.

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-------------+----+
|           id|name|
+-------------+----+
|            1|   a|
|1539090186541|   b|
|1539090186543|   c|
+-------------+----+


scala>

EDIT2:

I'm able to trick the System.nanoTime and get the increasing ids, but they will not be sequential, but the length can be maintained. See below

scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12)
getEpoch: (x: String)String

scala>  val myudfepoch = udf( getEpoch(_:String):String )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+------------+----+
|          id|name|
+------------+----+
|           1|   a|
|186127230392|   b|
|186127230399|   c|
+------------+----+


scala>

Try this out when running in clusters and adjust the take(12), if you get duplicate values.

@stack0114106 Your second edit still doesn't guarantee uniqueness. — zero323, Oct 09 '18 at 22:14
yes, it can happen rarely.. try with System.nanoTime.toString.take(14) or (16) — stack0114106, Oct 09 '18 at 22:16

Dataframe replace each row null values with unique epoch time

2 Answers2