9

I imported a PostgreSQL table into spark as a dataframe using Scala. The dataframe looks like

user_id | log_dt  
--------| -------    
96      | 2004-10-19 10:23:54.0    
1020    | 2017-01-12 12:12:14.931652

I am transforming this dataframe to have the data format for log_dt as yyyy-MM-dd hh:mm:ss.SSSSSS. To achieve this I used the following code to convert the log_dt to timestamp format using unix_timestamp function.

val tablereader1 = tablereader1Df.withColumn("log_dt",unix_timestamp(tablereader1Df("log_dt"),"yyyy-MM-dd hh:mm:ss.SSSSSS").cast("timestamp"))

When I print to print the tablereader1 dataframe using the command tablereader1.show() I get the following result

user_id | log_dt  
--------| -------
96      | 2004-10-19 10:23:54.0
1020    | 2017-01-12 12:12:14.0

How can I retain the microseconds as part of the timestamp? Any suggestions are appreciated.

np_6
  • 514
  • 1
  • 6
  • 19
Sid
  • 251
  • 2
  • 4
  • 17

1 Answers1

6

Milleseconds with date_format()

You can use Spark SQL date_format() which accepts Java SimpleDateFormat patterns. SimpleDateFormat can parse till milleseconds only with pattern "S".

import org.apache.spark.sql.functions._
import spark.implicits._ //to use $-notation on columns

val df = tablereader1Df.withColumn("log_dt", date_format($"log_dt", "S"))

Update: Microseconds with LocalDateTime of Java 8

//Imports
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoField;

/* //Commented as per comment about IntelliJ
spark.udf.register("date_microsec", (dt: String) => 
   val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.n")
   LocalDateTime.parse(dt, dtFormatter).getLong(ChronoField.MICRO_OF_SECOND)
)
*/

import org.apache.spark.sql.functions.udf

val date_microsec = udf((dt: String) => {
    val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.n")
    LocalDateTime.parse(dt, dtFormatter).getLong(ChronoField.MICRO_OF_SECOND)
})

Check: help in building DateTimeFormatter pattern

Use ChronoField.NANO_OF_SECOND instead of ChronoField.MICRO_OF_SECOND to fetch Nanosecond in UDF.

val df = tablereader1Df.withColumn("log_date_microsec", date_microsec($"log_dt"))
Community
  • 1
  • 1
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • 1
    I used `val df = tablereader1Df.withColumn("log_dt", date_format(tablereader1Df("log_dt"), "S"))` instead. It worked. thanks – Sid Jan 26 '17 at 20:10
  • One concern I have here is that the date_format() function is only giving me millisecond precision. Is there a way I can get microsecond precision. – Sid Jan 26 '17 at 20:26
  • 1
    `import spark.implicits._` will resolve issue to access columns directly with **$**. Pls check the update. – mrsrinivas Jan 27 '17 at 04:20
  • @mrsrinivas This still does not handle microseconds unfortunately. It can't be done with SimpleDateFormat and the java.util.Date doesn't even store that level of precision. – puhlen Jun 09 '17 at 14:12
  • @puhlen: Update! Included solution for micro/nanoseconds using Java 8. – mrsrinivas Jul 07 '17 at 19:37
  • Intellij idea can not find `date_microsec` :( How add it? – Cherry Jul 13 '17 at 04:51