2

I have a column in a SparkDataFrame containing timestamps in the following way:

    Start_1
    <chr>
    2016/01/01 10:51:15.304 
    2016/01/01 10:51:15.352

I let Spark infer the schema when reading the file, which yields chr as data type. I know that it would work without the millseconds, yielding the proper data type and column. However I need the milliseconds as well and wanted therefore to change the datatype to timestamp within the existing Spark Data Frame.

Here are the ways I have tested:

as POSIXct would work in base R.

    dataloan_time$start_ts <- as.POSIXct(dataloan$Start_1, format = "%Y/%m/%d %H:%M:%OS")

- doesn't work (doesn't know to change the class).

A solution mentioned here and on other sites mentions casting:

    dataloan_time <- withColumn(dataloan_time, "complete_ts", cast(dataloan$Complete_1, "timestamp"))

For me, it casts the data type correctly but the new column doesn't contain any data.

Here's the result for head(col)

    start_ts
    <lgl>
    NA  
    NA

Collecting the data frame and changing it would be the last option I found, but I'd like to avoid that and do it within the Spark Data frame. What other solutions are there? Ideally, it would be like the first try (as POSIXct).

Roman
  • 47
  • 2
  • 9

1 Answers1

1

I jumped on this because I thought it might be something simple, I don't think its that simple anymore :) Environment=Spark:1.5.2

#setup; probably a better way to set this up, oh well, R isn't my first language
d <- c("2016/01/01 10:51:15.304","2016/01/01 10:51:15.352")
df1 <- data.frame(d)
df = createDataFrame(sqlContext, df1)

start of my confusion:

> as.numeric(as.POSIXct(d))
[1] 1451663475 1451663475

why does it say the same number? they are different!!

> print(as.numeric(as.POSIXct(d[1])),digits=20)
[1] 1451663475.3039999008
> print(as.numeric(as.POSIXct(d[2])),digits=20)
[1] 1451663475.3519999981

Okay, so now I know they are represented the same way, just being printed not to my liking (How to parse milliseconds in R? lead me down the printing path)

from Writing R data frames returned from SparkR:::map I found a way to make UDF

rows <- SparkR:::flatMap(df, function(x) {
  data <- as.list(x)
  datanew <- append(data,as.POSIXct(data[[1]]))
  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
  do.call(mapply, append(args, datanew))
})
sdf <- createDataFrame(sqlContext, rows,c("origdt","rownum","newdt"))

if we now do normal prints of the dataframe we aren't gonna like what we see

head(sdf)
                   origdt rownum      newdt
1 2016/01/01 10:51:15.304      1 1451663475
2 2016/01/01 10:51:15.352      2 1451663475

#register it so we can run sql
registerTempTable(sdf,"sdf")

However if we order by "newdt" we can see that they are being converted correctly

head(sql(sqlContext,"select * from sdf order by newdt"))
                   origdt rownum      newdt
1 2016/01/01 10:51:15.304      1 1451663475
2 2016/01/01 10:51:15.352      2 1451663475

head(sql(sqlContext,"select * from sdf order by newdt desc"))

                   origdt rownum      newdt
1 2016/01/01 10:51:15.352      2 1451663475
2 2016/01/01 10:51:15.304      1 1451663475

We can also look at the underlying rdd data to confirm that the conversions happened as we expected:

print(take(rows,1)[[1]][[3]],digits=20)
[1] 1451663475.3039999008

Note: I've kept them as doubles the whole time because I assume the end-goal is to do datediff kinda things anyway, if you NEED the timestamp-timestamp representation I would think that you could alter the UDF to have it be stored as such

Community
  • 1
  • 1
James Tobin
  • 3,070
  • 19
  • 35
  • First of all, thanks for taking the time for this great solution! Unfortunately, I use Spark 2.1.0 and the problem is that, after executing the command to create "rows", I use createDataFrame as you did (with sqlContext). Howeever, the resulting data frame only has the two first columns? newDT doesn't show up ... – Roman Apr 13 '17 at 08:42
  • well that's no fun :( I don't have a readily available spark 2.1.0 Env nor is R my first language choice, so that makes proceeding difficult, Sorry :( – James Tobin Apr 13 '17 at 12:53