Dealing with milliseconds when casting String to timestamp in SparkR

Question

I have a column in a SparkDataFrame containing timestamps in the following way:

    Start_1
    <chr>
    2016/01/01 10:51:15.304 
    2016/01/01 10:51:15.352

I let Spark infer the schema when reading the file, which yields chr as data type. I know that it would work without the millseconds, yielding the proper data type and column. However I need the milliseconds as well and wanted therefore to change the datatype to timestamp within the existing Spark Data Frame.

Here are the ways I have tested:

as POSIXct would work in base R.

    dataloan_time$start_ts <- as.POSIXct(dataloan$Start_1, format = "%Y/%m/%d %H:%M:%OS")

- doesn't work (doesn't know to change the class).

A solution mentioned here and on other sites mentions casting:

    dataloan_time <- withColumn(dataloan_time, "complete_ts", cast(dataloan$Complete_1, "timestamp"))

For me, it casts the data type correctly but the new column doesn't contain any data.

Here's the result for head(col)

    start_ts
    <lgl>
    NA  
    NA

Collecting the data frame and changing it would be the last option I found, but I'd like to avoid that and do it within the Spark Data frame. What other solutions are there? Ideally, it would be like the first try (as POSIXct).

do you need to keep it as a "timestamp"? or would a float/double representation be acceptable — James Tobin, Apr 12 '17 at 17:33

score 1 · Answer 1 · edited May 23 '17 at 12:09

I jumped on this because I thought it might be something simple, I don't think its that simple anymore :) Environment=Spark:1.5.2

#setup; probably a better way to set this up, oh well, R isn't my first language
d <- c("2016/01/01 10:51:15.304","2016/01/01 10:51:15.352")
df1 <- data.frame(d)
df = createDataFrame(sqlContext, df1)

start of my confusion:

> as.numeric(as.POSIXct(d))
[1] 1451663475 1451663475

why does it say the same number? they are different!!

> print(as.numeric(as.POSIXct(d[1])),digits=20)
[1] 1451663475.3039999008
> print(as.numeric(as.POSIXct(d[2])),digits=20)
[1] 1451663475.3519999981

Okay, so now I know they are represented the same way, just being printed not to my liking (How to parse milliseconds in R? lead me down the printing path)

from Writing R data frames returned from SparkR:::map I found a way to make UDF

rows <- SparkR:::flatMap(df, function(x) {
  data <- as.list(x)
  datanew <- append(data,as.POSIXct(data[[1]]))
  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
  do.call(mapply, append(args, datanew))
})
sdf <- createDataFrame(sqlContext, rows,c("origdt","rownum","newdt"))

if we now do normal prints of the dataframe we aren't gonna like what we see

head(sdf)
                   origdt rownum      newdt
1 2016/01/01 10:51:15.304      1 1451663475
2 2016/01/01 10:51:15.352      2 1451663475

#register it so we can run sql
registerTempTable(sdf,"sdf")

However if we order by "newdt" we can see that they are being converted correctly

head(sql(sqlContext,"select * from sdf order by newdt"))
                   origdt rownum      newdt
1 2016/01/01 10:51:15.304      1 1451663475
2 2016/01/01 10:51:15.352      2 1451663475

head(sql(sqlContext,"select * from sdf order by newdt desc"))

                   origdt rownum      newdt
1 2016/01/01 10:51:15.352      2 1451663475
2 2016/01/01 10:51:15.304      1 1451663475

We can also look at the underlying rdd data to confirm that the conversions happened as we expected:

print(take(rows,1)[[1]][[3]],digits=20)
[1] 1451663475.3039999008

Note: I've kept them as doubles the whole time because I assume the end-goal is to do datediff kinda things anyway, if you NEED the timestamp-timestamp representation I would think that you could alter the UDF to have it be stored as such

First of all, thanks for taking the time for this great solution! Unfortunately, I use Spark 2.1.0 and the problem is that, after executing the command to create "rows", I use createDataFrame as you did (with sqlContext). Howeever, the resulting data frame only has the two first columns? newDT doesn't show up ... — Roman, Apr 13 '17 at 08:42
well that's no fun :( I don't have a readily available spark 2.1.0 Env nor is R my first language choice, so that makes proceeding difficult, Sorry :( — James Tobin, Apr 13 '17 at 12:53

Dealing with milliseconds when casting String to timestamp in SparkR

1 Answers1