Applying withColumn function with regular expression patterns in SparkR: reformat a string column in a DataFrame

Question

Background/overview:

I am attempting to apply the gsub function to a column of a SparkR DataFrame that I have loaded into Spark from as follows:

dat <- read.df(sqlContext, "filepath", header='false', inferSchema='true')

I am using Spark 1.6.1 and the data file was stored as a parquet file before reading it in as a SparkR DataFrame.

The core of the problem:

I have a column called, period in my DataFrame (DF) that is made of dates that are currently in string form MM/DD/YYYY, e.g. 09/23/2001. I would like to convert this into a date type object in SparkR. What I can tell, however, is that the functions cast and as.Date in SparkR can only convert a string date into a date type object if it is in the format MM-DD-YYYY.

In attempting to get my period column into a form that can be recast to a date dtype, I'm attempting to use the gsub R function with the withColumn SparkR function to create a new DF, dat2 with an appended column, nperiod, with all of the row entries of period converted from the form MM/DD/YYYY to MM-DD-YYYY. My first attempt at this is given by the code below, but I received the error message that follows: dat2 <- withColumn(dat, "nperiod", gsub("/", "-", dat$period))

dat2 <- withColumn(dat, "nperiod", gsub("/", "-", dat$period)) Error in withColumn(dat, "nperiod", gsub("/", "-", dat$period)) : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in as.character.default(x) : no method for coercing this S4 class to a vector

Perhaps this is simply my ignorance of how core Spark uses S4 data classes in SparkR, but I'm not sure how to interpret this error message or how to proceed with troubleshooting the gsub approach to this problem.

Alternatively, and a much hackier approach, would be to split the MM/DD/YYYY period column into three separate columns. Even this, however, I am struggling with in the SparkR environment. I've gotten so far as to create a new DF, called separated, that consists of a single column (period_sep) that rows of the period components separated by commas, though I'm not entirely sure what data structure this is in, or the next step to get this into three separate columns.

> separated <- selectExpr(dat, "split(period, '/') AS period_sep")
> head(separated)
    period_sep
1 01, 01, 2000
2 02, 01, 2000
3 03, 01, 2000
4 04, 01, 2000
5 05, 01, 2000
6 06, 01, 2000

If anyone has thoughts on how to proceed in either of those directions, or if there is a much better way to do this, it would be very appreciated. Additionally, if it seems as though I'm not understanding some underlying Spark concept that would help explain what's going on, please feel free to share any information concerning that.

Edit: Adding information about error received when I attempt to use cast:

When I attempt to cast period to the date dtype, using withColumn, I get the following error message:

dat2 <- withColumn(dat, "nperiod", cast(dat$period, "date")) Error in withColumn(dat, "nperiod", cast(dat$period, "date")) : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in cast(dat$period, "date") : error in evaluating the argument 'x' in selecting a method for function 'cast': Error in column(callJMethod(x@sdf, "col", c)) : error in evaluating the argument 'x' in selecting a method for function 'column': Error in callJMethod(x@sdf, "col", c) : Invalid jobj 2. If SparkR was restarted, Spark operations need to be re-executed.

Just use Spark date processing functions. It is not possible to execute plain R code this way. — zero323, Jun 23 '16 at 17:02
Do you mean that I should use `as.Date` or `cast`? Because I tried both of those and received errors. — kathystehl, Jun 23 '16 at 18:15
@zero323 I've added additional information to my question about my attempt to use the `cast` function to convert `period` to the date dtype. — kathystehl, Jun 23 '16 at 18:21
Thanks for the link @zero323. Are you saying that I need to specify a new date format and then include this as my dtype specification in `cast`? — kathystehl, Jun 23 '16 at 21:42
Parse using unix_timestamp with format -> cast to timestamp -> process further. — zero323, Jun 23 '16 at 22:00
I believe that I understand the second step in your suggestion, i.e. something like `dat_per_ts <- withColumn(dat, "period_ts", cast(dat$period, "timestamp"))`. However, could you elaborate on what you mean by parsing a DataFrame column using the `unix_timestamp` SparkR operation? — kathystehl, Jun 23 '16 at 23:14

score 4 · Accepted Answer · answered Jun 24 '16 at 00:06

You cannot use standard R functions in this context but in Spark 1.6 you can use built-in date processing functions:

df <- createDataFrame(sqlContext, data.frame(ds=c('04/02/2015', '03/10/2014')))

dt <- cast(cast(unix_timestamp(df$ds, 'MM/dd/yyyy'), 'timestamp'), 'date')

df %>% withColumn('date', dt) %>% head()
##           ds       date
## 1 04/02/2015 2015-04-02
## 2 03/10/2014 2014-03-10

Applying withColumn function with regular expression patterns in SparkR: reformat a string column in a DataFrame

1 Answers1