-1

How does it work to aggregate a variable in a postgres db backend table to its unique value. For example i have the following table:

library(tidyverse)
library(dbplyr)

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

But neither unique() nor distinct() are doing the job. Any ideas how to achieve my desired outcome like so when i collect() the table before summarise:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
collect() %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

# A tibble: 1 x 2
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

r2evans
  • 141,215
  • 6
  • 77
  • 149
werN
  • 109
  • 7

2 Answers2

1

You can just add a group_by to your dplyr pipe:

> dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
+   group_by(a) %>% 
+   summarise(aggregatedSum = sum(b)) %>% 
+   rename(aggregatedUnique = a) %>% 
+   select(aggregatedSum, aggregatedUnique)

# Source:   lazy query [?? x 2]
# Database: sqlite 3.34.1 [:memory:]
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

If there are multiple distinct values in column a, this will return a row per value (with the sum of the b values that occur alongside them).

henryn
  • 1,163
  • 4
  • 15
  • `group_by()` would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in `a` is equal anyway. – werN Oct 15 '21 at 06:38
1

I might be misinterpreting, but this seems like a grouping operation, where you might want the sum of b for each unique value of a. If so, then group_by(a):

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b))
# Source:   lazy query [?? x 2]
# # Database: sqlite 3.33.0 [:memory:]
#       a aggregatedSum
#   <dbl>         <dbl>
# 1     2             9

This is related to How to combine SELECT DISTINCT and SUM() in that I believe SQL does not let you sum(.) and distinct(.) in the same query. The unchanged query looks like:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  summarise(aggregatedSum = sum(b), 
          aggregatedUnique = distinct(a)) %>%
  show_query()
# <SQL>
# SELECT SUM(`b`) AS `aggregatedSum`, distinct(`a`) AS `aggregatedUnique`
# FROM `dbplyr_014`

whereas the updated query is

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b)) %>%
  show_query()
# <SQL>
# SELECT `a`, SUM(`b`) AS `aggregatedSum`
# FROM `dbplyr_016`
# GROUP BY `a`

which is aligned with the linked question/answer.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • `group_by()` would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in `a` is equal anyway. – werN Oct 15 '21 at 06:38
  • The data-viewpoint here is confusing to me. I suspect that it is overly-simplified for the sake of a question on SO, where the variability of the larger real data is unknown to us. Good luck. – r2evans Oct 15 '21 at 09:51