Background/overview:
I am attempting to apply the gsub
function to a column of a SparkR DataFrame that I have loaded into Spark from as follows:
dat <- read.df(sqlContext, "filepath", header='false', inferSchema='true')
I am using Spark 1.6.1 and the data file was stored as a parquet file before reading it in as a SparkR DataFrame.
The core of the problem:
I have a column called, period
in my DataFrame (DF) that is made of dates that are currently in string form MM/DD/YYYY, e.g. 09/23/2001. I would like to convert this into a date type object in SparkR. What I can tell, however, is that the functions cast
and as.Date
in SparkR can only convert a string date into a date type object if it is in the format MM-DD-YYYY.
In attempting to get my period
column into a form that can be recast to a date dtype, I'm attempting to use the gsub
R function with the withColumn
SparkR function to create a new DF, dat2
with an appended column, nperiod
, with all of the row entries of period
converted from the form MM/DD/YYYY to MM-DD-YYYY. My first attempt at this is given by the code below, but I received the error message that follows:
dat2 <- withColumn(dat, "nperiod", gsub("/", "-", dat$period))
dat2 <- withColumn(dat, "nperiod", gsub("/", "-", dat$period)) Error in withColumn(dat, "nperiod", gsub("/", "-", dat$period)) : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in as.character.default(x) : no method for coercing this S4 class to a vector
Perhaps this is simply my ignorance of how core Spark uses S4 data classes in SparkR, but I'm not sure how to interpret this error message or how to proceed with troubleshooting the gsub
approach to this problem.
Alternatively, and a much hackier approach, would be to split the MM/DD/YYYY period
column into three separate columns. Even this, however, I am struggling with in the SparkR environment. I've gotten so far as to create a new DF, called separated
, that consists of a single column (period_sep
) that rows of the period
components separated by commas, though I'm not entirely sure what data structure this is in, or the next step to get this into three separate columns.
> separated <- selectExpr(dat, "split(period, '/') AS period_sep")
> head(separated)
period_sep
1 01, 01, 2000
2 02, 01, 2000
3 03, 01, 2000
4 04, 01, 2000
5 05, 01, 2000
6 06, 01, 2000
If anyone has thoughts on how to proceed in either of those directions, or if there is a much better way to do this, it would be very appreciated. Additionally, if it seems as though I'm not understanding some underlying Spark concept that would help explain what's going on, please feel free to share any information concerning that.
Edit: Adding information about error received when I attempt to use cast:
When I attempt to cast period
to the date dtype, using withColumn
, I get the following error message:
dat2 <- withColumn(dat, "nperiod", cast(dat$period, "date")) Error in withColumn(dat, "nperiod", cast(dat$period, "date")) : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in cast(dat$period, "date") : error in evaluating the argument 'x' in selecting a method for function 'cast': Error in column(callJMethod(x@sdf, "col", c)) : error in evaluating the argument 'x' in selecting a method for function 'column': Error in callJMethod(x@sdf, "col", c) : Invalid jobj 2. If SparkR was restarted, Spark operations need to be re-executed.