Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr
operation:
new.df <- old.df %>%
group_by("column1") %>%
do(myfunc(.))
I currently have a large SparkR DataFrame of the form:
timestamp value id
2015-09-01 05:00:00.0 1.132 24
2015-09-01 05:10:00.0 null 24
2015-09-01 05:20:00.0 1.129 24
2015-09-01 05:00:00.0 1.131 47
2015-09-01 05:10:00.0 1.132 47
2015-09-01 05:10:00.0 null 47
I've sorted by id
and timestamp
.
I want to group by id
, but I don't want to aggregate. Instead I want to do a set of transformations and computations on each group -- for example, interpolating to fill NAs (which are generated when I collect
the DataFrame and then convert value
to numeric). I've tested using agg
, but while my computations do appear to run, the results aren't returned, because I'm not returning a single value in myfunc
:
library(zoo)
myfunc <- function(df) {
df.loc <- collect(df)
df.loc$value <- as.numeric(df.loc$value)
df.loc$newparam <- na.approx(df.loc$value, na.rm = FALSE)
return(df.loc)
# I also tested return(createDataFrame(sqlContext, df.loc)) here
}
df <- read.df( # some stuff )
grp <- group_by(df, "id")
test <- agg(grp, "myfunc")
15/11/11 18:45:33 INFO scheduler.DAGScheduler: Job 2 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 0.463131 s
id
1 24
2 47
Note that the operations in myfunc
all work correctly when I filter
the DataFrame down to a single id
and run it. Based on the time it takes to run (about 50 seconds per task) and the fact that no exceptions are thrown, I believe myfunc
is indeed being run on all of the id
s -- but I need the output!
Any input would be appreciated.