3

I would like to transform a spark dataframe column from its value hour min seconds

E.g "01:12:17.8370000"

Would become 4337 s thanks for the comment.

or "00:00:39.0390000"

would become 39 s.

I have read this question but I am lost on how I can use this code to transform my spark dataframe column.

Convert HH:mm:ss in seconds

Something like this

df.withColumn("duration",col("duration")....)

I am using scala 2.10.5 and spark 1.6

Thank you

Community
  • 1
  • 1
javadev
  • 277
  • 3
  • 19

3 Answers3

5

Assuming the column "duration" contains the duration in a string, you can just use "unix_timestamp" function of the functions package to get the number of seconds passing the pattern:

import org.apache.spark.sql.functions._

val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration") 

val newColumn = unix_timestamp(col("duration"), "HH:mm:ss")
val result = df.withColumn("duration", newColumn)
result.show

+--------+
|duration|
+--------+
|    4337|
|      39|
+--------+
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • Thanks for the response. One question will this keep the ordering of the row for example if row 1 has a time of 20 minutes e.g "00:02:00.00000"when I add the column back in will this always come into the same row with your solution. Will it maintain the ordering of the rows? – javadev Jul 31 '17 at 14:42
  • I ended up accepting the other answer this is the first time I asked a questio here and got 2 very good answers.Thanks I just found it easier to use his but thanks for your edit and your answer really appreciate it :). – javadev Jul 31 '17 at 14:50
  • @gimp770 I found an even easier solution and updated my answer. Please note that using UDFs (as in the other answer) is not always recommended as Spark is not able to optimize the performance. Using the built-in functions found in the `sql.functions` package should be preferred whenever possible. – Daniel de Paula Jul 31 '17 at 14:54
  • be careful as `unix_timestamp()` assumes the session's timezone, so if your session is not in UTC you'll get a wrong duration – gevra Apr 11 '22 at 21:27
3

If you have a string column, you can write a udf to calculate this manually:

val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")    

def str_sec = udf((s: String) => {  
    val Array(hour, minute, second) = s.split(":")
    hour.toInt * 3600 + minute.toInt * 60 + second.toDouble.toInt
})

df.withColumn("duration", str_sec($"duration")).show
+--------+
|duration|
+--------+    
|    4337|
|      39|
+--------+
Psidom
  • 209,562
  • 33
  • 339
  • 356
2

there are inbuilt functions you can take advantage of which are faster and efficient than using udf functions

given input dataframe as

+----------------+
|duration        |
+----------------+
|01:12:17.8370000|
|00:00:39.0390000|
+----------------+

so you can do something like below

df.withColumn("seconds", hour($"duration")*3600+minute($"duration")*60+second($"duration"))

you should be getting output as

+----------------+-------+
|duration        |seconds|
+----------------+-------+
|01:12:17.8370000|4337   |
|00:00:39.0390000|39     |
+----------------+-------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97