Similliar aggregate function to unique/distinct in R Postgres Backend

Question

How does it work to aggregate a variable in a postgres db backend table to its unique value. For example i have the following table:

library(tidyverse)
library(dbplyr)

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

But neither unique() nor distinct() are doing the job. Any ideas how to achieve my desired outcome like so when i collect() the table before summarise:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
collect() %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

# A tibble: 1 x 2
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

Does column `a` in your database always only contain one distinct value? Or do you have several distinct values in column `a` and you want a row for each one in your output? — henryn, Oct 14 '21 at 14:55

score 1 · Answer 1 · answered Oct 14 '21 at 15:16

You can just add a group_by to your dplyr pipe:

> dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
+   group_by(a) %>% 
+   summarise(aggregatedSum = sum(b)) %>% 
+   rename(aggregatedUnique = a) %>% 
+   select(aggregatedSum, aggregatedUnique)

# Source:   lazy query [?? x 2]
# Database: sqlite 3.34.1 [:memory:]
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

If there are multiple distinct values in column a, this will return a row per value (with the sum of the b values that occur alongside them).

`group_by()` would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in `a` is equal anyway. — werN, Oct 15 '21 at 06:38

r2evans · Answer 2 · 2021-10-14T15:22:50.873

I might be misinterpreting, but this seems like a grouping operation, where you might want the sum of b for each unique value of a. If so, then group_by(a):

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b))
# Source:   lazy query [?? x 2]
# # Database: sqlite 3.33.0 [:memory:]
#       a aggregatedSum
#   <dbl>         <dbl>
# 1     2             9

This is related to How to combine SELECT DISTINCT and SUM() in that I believe SQL does not let you sum(.) and distinct(.) in the same query. The unchanged query looks like:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  summarise(aggregatedSum = sum(b), 
          aggregatedUnique = distinct(a)) %>%
  show_query()
# <SQL>
# SELECT SUM(`b`) AS `aggregatedSum`, distinct(`a`) AS `aggregatedUnique`
# FROM `dbplyr_014`

whereas the updated query is

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b)) %>%
  show_query()
# <SQL>
# SELECT `a`, SUM(`b`) AS `aggregatedSum`
# FROM `dbplyr_016`
# GROUP BY `a`

which is aligned with the linked question/answer.

`group_by()` would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in `a` is equal anyway. — werN, Oct 15 '21 at 06:38
The data-viewpoint here is confusing to me. I suspect that it is overly-simplified for the sake of a question on SO, where the variability of the larger real data is unknown to us. Good luck. — r2evans, Oct 15 '21 at 09:51

Similliar aggregate function to unique/distinct in R Postgres Backend

2 Answers2