1

Hi I have a a following spark dataframe. I want to sum across rows. I couldn't find any equivalent function of row sum in sparklyr. I tried the following query but it sums the column values.

trans1M_20 %>% 
        dplyr::select(trans1M_20_clicks) %>%
        dplyr::select(Small_SubChannel_Clicks_names) %>%
        dplyr::summarise_each(funs(sum))

Note that Alma_1 is the Small_Subchannel_Clicks_names.

Alma_1

<dbl>

0               
0               
0               
0               
0               
0               
0               
0               
0               
0

I am trying to aggregate multiple columns into one single column and summing the rows. Small_SubChannel_Clicks_names can contain n number of column names. I want to do something like:

rowSums(trans1M_20[, Small_SubChannel_Clicks_names,drop=F]) –
Community
  • 1
  • 1
Mustufain
  • 198
  • 12

2 Answers2

3

As long as number of columns is low you can just generate and evaluate the expression:

library(dplyr)
library(rlang)

df <- copy_to(sc, data.frame(x=c(1, 3), y=c(2, 4)))

columns_to_sum <- colnames(df)

df %>% transmute(total = !!parse_expr(paste(columns_to_sum, collapse = " + ")))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  total
  <dbl>
1     3
2     7

Unfortunately this won't work with larger number of columns. For workarounds see Efficiently calculate row totals of a wide spark DF.

zero323
  • 322,348
  • 103
  • 959
  • 935
1

There is a way to do what you want by means of using the DBI package, which allows you to send SQL queries to your Spark table. It's not a dplyr solution, but it might be useful anyway until someone else comes with a more "dply-ish" solution.

You have to manually include your column names in your query. For instance, if you wanted to get the sum of the columns "trans1M_20_clicks" and "Small_SubChannel_Clicks_names", you could do something like this:

resSparkSQL <- DBI::dbSendQuery(sc, "SELECT trans1M_20_clicks + Small_SubChannel_Clicks_names as Total FROM your_table_name_in_spark")

The result is a DBISparkResultObject, although you could get the result column into R as a data.frame:

resSparkSQL <- DBI::dbGetQuery(sc, "SELECT trans1M_20_clicks + Small_SubChannel_Clicks_names as Total FROM your_table_name_in_spark")
Jaime Caffarel
  • 2,401
  • 4
  • 30
  • 42
  • What do you mean by "Small_SubChannel_Clicks_names can contain any number of columns"? When you use `dplyr::select(Small_SubChannel_Clicks_names)`, you are selecting just that column. – Jaime Caffarel Nov 22 '17 at 16:13
  • Small_SubChannel_Click_names contains names of multiple column so when I used this in select statement so I get a subset of these columns no I want to merge them into one single column. – Mustufain Nov 22 '17 at 16:19
  • 1
    No need for manual thing - just `paste("SELECT", paste(cols, collapse=" + "), "as Total FROM your_table_name_in_spark")` or `glue::glue("SELECT {paste(cols, collapse=' + ')} as Total FROM your_table_name_in_spark")` where `cols` is a vector of names. – zero323 Dec 20 '17 at 17:51