1

I try to implement SQL query using sqldf and dplyr.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using dplyr.

library(sqldf)
library(dplyr)

Id       <- c(1,2,3,4)
HasPet   <- c(0,0,1,1)
Age      <- c(20,1,14,10)

Posts <- data.frame(Id, HasPet, Age)

# sqldf way
ref <- sqldf("
      SELECT Id, HasPet, MAX(Age) AS MaxAge
      FROM Posts
      GROUP BY HasPet
  ")

# dplyr way
res <- Posts %>%
  group_by(HasPet) %>%
  summarize(
    Id,
    HasPet,
    MaxAge = max(Age)
    ) %>%
  select(Id, HasPet, MaxAge)

head(ref)
head(res)

Output for sqldf is:

> head(ref)
  Id HasPet MaxAge
1  1      0     20
2  3      1     14

while the output for sqldf is different:

> head(res)
# A tibble: 4 x 3
# Groups:   HasPet [2]
     Id HasPet MaxAge
  <dbl>  <dbl>  <dbl>
1     1      0     20
2     2      0     20
3     3      1     14
4     4      1     14

UPD. SQL query cannot be modified.

lol lol
  • 319
  • 3
  • 18
  • What is the exact output you want from `dplyr` and why? – Tim Biegeleisen Apr 23 '21 at 09:46
  • @Tim Biegeleisen This has been requested on some R-courses. I need to get the same result as sqldf produces but using dplyr. This is a prerequisite for passing the test. – lol lol Apr 23 '21 at 09:54

3 Answers3

1

The code is not wrong, but the logic you are trying to achieve is it. Let me explain:

Your expected output for your grouping contains Id=1,3. But how R will know are those and not Id=2,4?. More specific, when you are grouping by HasPet=0, which value for Id will R choose? 1 or 2? How R will know it if you didn't give it specific criteria to use? That said, this gives your expected output:

res <- Posts %>%
  group_by(HasPet) %>%
  summarize(Id = min(Id),
            MaxAge = max(Age))
Chris
  • 2,019
  • 5
  • 22
  • 67
0

The answer to your question is that the SQL query is not doing the same thing as your R code version. Here is the equivalent SQL query:

SELECT Id, HasPet, MAX(Age) OVER (PARTITION BY HasPet) AS MaxAge
FROM Posts

Acutally, your current query is technically invalid, because it aggregates by HasPet, but selects the Id. It isn't clear which value of Id you want to select. Here is a valid version of your original query:

SELECT HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you for your answer! Actually, my question was about the way to get the same behaviour as sqldf has, but using dplyr. Instead, you've adjusted sqldf's query request. I'm not aware if original SQL query is invalid. I think that it is not, because it executed without warnings or errors. – lol lol Apr 23 '21 at 09:35
  • The SQL output you are trying to match is being generated by an invalid query. Please explain the logic behind how the `Id` values are being selected. – Tim Biegeleisen Apr 23 '21 at 09:37
  • Are you sure this request is invalid? Are there any limitations in the SQL documentation related to this particular query? As I understand final result from sqldf produces result with filtered duplicated values listed in HasPet column. I also found that same behavior as sqldf has can be achieved using "aggregate" (it is not dplyr function). See https://stackoverflow.com/questions/6289538/aggregate-a-dataframe-on-a-given-column-and-display-another-column – lol lol Apr 23 '21 at 12:01
  • My previous comment had wrong URL. It was fixed now. Sorry. – lol lol Apr 23 '21 at 12:03
  • I agree with you that original SQL request looks weird. Anyway, it works. I'm unsure if sql query is technically invalid. – lol lol Apr 23 '21 at 14:51
  • The sqldf query in the question using Id is valid. sqlite supports an extension to sql in which if you use max or min the other columns are guaranteed to be taken from the same row. – G. Grothendieck Apr 23 '21 at 19:45
  • @G.Grothendieck Not valid according to ANSI SQL, nor should you be promoting such syntax. And I still don't know the logic by which we end up with thoss `id` values. – Tim Biegeleisen Apr 23 '21 at 23:42
  • It is documented behavior. See the last paragraph of the Generation of Result Rows section of this page: https://sqlite.org/lang_select.html – G. Grothendieck Apr 24 '21 at 12:08
  • @G.Grothendieck That it is documented doesn't mean much, because the feature is non ANSI compliant and bad practice. MySQL also sometimes allows it, depending on the server mode. [See this canonical SO question](https://stackoverflow.com/questions/41887460/select-list-is-not-in-group-by-clause-and-contains-nonaggregated-column-inc) to better understand why the OP's SQL may be dangerous. – Tim Biegeleisen Apr 24 '21 at 13:52
  • If you are trying to write for multiple databases at the same time such as using SQLite in advance of moving to MySQL it would make sense to use the intersection of their features but that is usually not the case. Suggest you try writing the same code without using that feature and notice how much more code is needlessly involved. – G. Grothendieck Apr 24 '21 at 14:01
0

This problem can be solved by using:

slice(which.min(Id))

after "group_by" and "summarize" function calls.

For example:

# dplyr way
res <- Posts %>%
  group_by(HasPet) %>%
  summarize(
    Id,
    HasPet,
    MaxAge = max(Age)
    ) %>%
  select(Id, HasPet, MaxAge) %>%
  slice(which.min(Id))

In this case, output is the same as if to use dplyr:

> res
# A tibble: 2 x 3
# Groups:   HasPet [2]
     Id HasPet MaxAge
  <dbl>  <dbl>  <dbl>
1     1      0     20
2     3      1     14

P.S. I think there are simpler ways, but so far I have not found them.

lol lol
  • 319
  • 3
  • 18