I am a regular R user.
For a data.frame
that looks like the one below I would like to count basic aggregation statistics; minimum, 1st quantile, median, 3rd quantile and maximum. The following code using reshape2
package and dplyr
to proceed with that opperation in regular ordinary R is
library(reshape2)
library(dplyr)
tidy_data <- data.frame( topic1 = rnorm(10^6),
topic2 = rnorm(10^6),
topic3 = rnorm(10^6),
topic4 = rnorm(10^6),
topic5 = rnorm(10^6))
tidy_data %>%
melt(measure.vars = c("topic1","topic2","topic3","topic4","topic5")) %>%
group_by(variable) %>%
summarise( MIN = min(value),
Q1 = quantile(value, 0.25),
Q2 = median(value),
Q3 = quantile(value, 0.75),
MAX = max(value))
I am wondering how such operations can be reproduced on a distributed data frame ( Spark's DataFrame
object ) in sparkR
.I've managed to calculate maximum of each variable but in a not sufficient and elegant way. Is there a way to do it in an efficient and smooth way?
My sparkR
codeis below:
system.time({
print(
head(
summarize(topics5,
MAX5 = max(topics5$topic5),
MAX4 = max(topics5$topic4),
MAX3 = max(topics5$topic3),
MAX2 = max(topics5$topic2),
MAX1 = max(topics5$topic1)
)
)
)
})