For the example you have given, it is not clear to me how
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1
HAVING COL2 NOT IN ID_LIST
Is different from
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
AND COL2 NOT IN ID_LIST
GROUP BY COL1
Hence @Rohit's suggestion of applying a filter is an effective solution.
HAVING
largely operates the same way as WHERE
but after aggregation with the added feature that you can use aggregators in the HAVING
clause. See this discussion. But in this case you are not applying aggregators in the HAVING
clause so you should be free to use a WHERE
clause instead.
Regarding the nested SQL queries that dbplyr produces. It might seem counter intuitive given the usual emphasis on clean, human-readable code, but for dbplyr auto-generated queries I recommend not worrying about the quality of the machine generated code. It is written by a machine and it is (mostly) read by a machine, so its human readability is less important.
Efficiency could be a concern with many layers of nesting. However, in 2017-06-09 dbplyr was given a basic SQL optimiser. I have not found (though I have not tested extensively) nested auto-generated queries to perform significantly worse than non-nested user written queries. But if performance is critical, you probably want to build your SQL query manually by paste
-ing together text strings in R.
One final thought - the length of ID_LIST
is also important to consider. It is discussed in this question.