1

How do I perform the GROUP BY ... HAVING query using dbplyr in dplyr?

I have a list of IDs and I have to group by IDs which are not in this list. Is there a way I can directly execute the query with tbl(), if not what is the dplyr verb for the same?

Using group_by_if function from dplyr doesn't seem to do it.

I want to execute something like

SELECT * FROM TBL 
WHERE YEAR(DATE) = 2001
GROUP BY COL1 HAVING COL2 NOT IN ID_LIST

where ID_LIST is an R vector

Surya
  • 171
  • 1
  • 9

1 Answers1

1

For the example you have given, it is not clear to me how

SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1
HAVING COL2 NOT IN ID_LIST

Is different from

SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
AND COL2 NOT IN ID_LIST
GROUP BY COL1

Hence @Rohit's suggestion of applying a filter is an effective solution.

HAVING largely operates the same way as WHERE but after aggregation with the added feature that you can use aggregators in the HAVING clause. See this discussion. But in this case you are not applying aggregators in the HAVING clause so you should be free to use a WHERE clause instead.

Regarding the nested SQL queries that dbplyr produces. It might seem counter intuitive given the usual emphasis on clean, human-readable code, but for dbplyr auto-generated queries I recommend not worrying about the quality of the machine generated code. It is written by a machine and it is (mostly) read by a machine, so its human readability is less important.

Efficiency could be a concern with many layers of nesting. However, in 2017-06-09 dbplyr was given a basic SQL optimiser. I have not found (though I have not tested extensively) nested auto-generated queries to perform significantly worse than non-nested user written queries. But if performance is critical, you probably want to build your SQL query manually by paste-ing together text strings in R.

One final thought - the length of ID_LIST is also important to consider. It is discussed in this question.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41