0

I am a regular R user.

For a data.frame that looks like the one below I would like to count basic aggregation statistics; minimum, 1st quantile, median, 3rd quantile and maximum. The following code using reshape2 package and dplyr to proceed with that opperation in regular ordinary R is

library(reshape2)
library(dplyr)
tidy_data <- data.frame( topic1 = rnorm(10^6),
                                                 topic2 = rnorm(10^6),
                                                 topic3 = rnorm(10^6),
                                                 topic4 = rnorm(10^6),
                                                 topic5 = rnorm(10^6))
tidy_data %>% 
    melt(measure.vars = c("topic1","topic2","topic3","topic4","topic5")) %>% 
    group_by(variable) %>%
    summarise( MIN = min(value),
                         Q1 = quantile(value, 0.25),
                            Q2 = median(value),
                            Q3 = quantile(value, 0.75),
                            MAX = max(value))

I am wondering how such operations can be reproduced on a distributed data frame ( Spark's DataFrame object ) in sparkR.I've managed to calculate maximum of each variable but in a not sufficient and elegant way. Is there a way to do it in an efficient and smooth way?

My sparkR codeis below:

system.time({
    print(
        head(
            summarize(topics5, 
                                MAX5 = max(topics5$topic5), 
                                MAX4 = max(topics5$topic4),
                                MAX3 = max(topics5$topic3), 
                                MAX2 = max(topics5$topic2), 
                                MAX1 = max(topics5$topic1)
                                )
            )
        )
    })
Marcin
  • 7,834
  • 8
  • 52
  • 99
  • 1
    Generally speaking computing exact quantiles on large datasets is usually not practical. It is possible to use internal API to get something similar to [this](http://stackoverflow.com/a/31437177) but once again I doubt it is a good idea. Regarding the rest of your question... If by "efficient and smooth" you mean something similar to typical to R metaprogramming magic I am afraid there is nothing like this (yet?). No `substitute`, `lazy_eval` and similar stuff in SparkR source. – zero323 Jul 16 '15 at 19:47
  • Spark SQL is much more mature and can be easily accessed from SparkR though. Something like `registerTempTable(topics5, "topics5"); sql(sqlContext, "SELECT max(topic5) max5, max(topic4) max4, ..., FROM topics5")` is probably worth considering. – zero323 Jul 16 '15 at 19:53
  • 2
    Did you try out the new `SparkRext` package available on github https://github.com/hoxo-m/SparkRext. It uses the same syntax of `dplyr` for manipulating Spark `DataFrames`. – KRC Jul 17 '15 at 01:58
  • You could also use the `describe` method in SparkR to generate statistics on numeric columns. For example `df <- data.frame(a=rnorm(10), b=rnorm(10)); sdf <- createDataFrame(sqlContext, df); collect(describe(sdf))`. This should print count, mean, max, min, etc. – Shivaram Venkataraman Jul 17 '15 at 17:15
  • Thank You very much shivaram :)! – Marcin Jul 18 '15 at 22:50
  • You can post it as an answer @ShivaramVenkataraman – Marcin Jul 18 '15 at 22:50

1 Answers1

2

You can use the describe method in SparkR to generate statistics on numeric columns. For example:

df <- data.frame(a=rnorm(10), b=rnorm(10))
sdf <- createDataFrame(sqlContext, df)
collect(describe(sdf)).

This should print count, mean, max, min, etc