2

How to add index by category in R with sorting by column in sqldf package. I look for equivalent of SQL:

ROW_NUMBER() over(partition by [Category] order by [Date] desc

Suppose we have a table:

+----------+-------+------------+
| Category | Value |    Date    |
+----------+-------+------------+
| apples   |     3 | 2018-07-01 |
| apples   |     2 | 2018-07-02 |
| apples   |     1 | 2018-07-03 |
| bananas  |     9 | 2018-07-01 |
| bananas  |     8 | 2018-07-02 |
| bananas  |     7 | 2018-07-03 |
+----------+-------+------------+

Desired results are:

+----------+-------+------------+-------------------+
| Category | Value |    Date    | Index by category |
+----------+-------+------------+-------------------+
| apples   |     3 | 2018-07-01 |                 3 |
| apples   |     2 | 2018-07-02 |                 2 |
| apples   |     1 | 2018-07-03 |                 1 |
| bananas  |     9 | 2018-07-01 |                 3 |
| bananas  |     8 | 2018-07-02 |                 2 |
| bananas  |     7 | 2018-07-03 |                 1 |
+----------+-------+------------+-------------------+

Thank you for hints in comments how it can be done in lots of other packages different then sqldf: Numbering rows within groups in a data frame

Przemyslaw Remin
  • 6,276
  • 25
  • 113
  • 191
  • Possible duplicate of https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame , let us know if this link helps, then we can close this post as duplicate. – zx8754 Jul 23 '18 at 09:28
  • @zx8754 Thank you for this valuable link. I have reworded my question as I have not found in the link how to do it in sqldf package. – Przemyslaw Remin Jul 23 '18 at 16:31
  • No, sqldf doesn't have `over(partition` syntax. It could be done using sqldf, I will add the link when I find the solution, I remember seeing similar post asking for sqldf solution. – zx8754 Jul 23 '18 at 17:08
  • Related post: https://stackoverflow.com/q/32364351/680068 – zx8754 Jul 23 '18 at 19:17
  • Also from same Q&A see comment by [@G.Grothendieck (sqldf author)](https://stackoverflow.com/questions/32364351/unable-to-use-rank-over-functions-in-r-using-sqldf#comment52630228_32365046) – zx8754 Jul 23 '18 at 19:19
  • 1
    @zx8754. sqldf is just the front end. As in my answer below it does allow use of over partition if you use the PostgreSQL backend with it. – G. Grothendieck Jul 24 '18 at 13:09
  • @G.Grothendieck Correct, sorry I meant to say SQLite. Thank you for the answer. – zx8754 Jul 24 '18 at 13:49

1 Answers1

2

1) PostgreSQL This can be done with the PostgreSQL backend to sqldf:

library(RPostgreSQL)
library(sqldf)

sqldf('select *, 
       ROW_NUMBER() over (partition by "Category" order by "Date" desc) as seq
       from "DF"
       order by "Category", "Date" ')

giving:

  Category Value       Date seq
1   apples     3 2018-07-01   3
2   apples     2 2018-07-02   2
3   apples     1 2018-07-03   1
4  bananas     9 2018-07-01   3
5  bananas     8 2018-07-02   2
6  bananas     7 2018-07-03   1

2) SQLite To do it with the SQLite backend (which is the default backend) we need to revise the SQL statement appropriately. Be sure that RPostgreSQL is NOT loaded before doing this. We have assumed that the data is already sorted by Date within each Category based on the data shown in the question but if that were not the case it would be easy enough to extend the SQL to sort it first.

library(sqldf)

sqldf("select a.*, count(*) seq 
       from DF a left join DF b on a.Category = b.Category and b.rowid >= a.rowid 
       group by a.rowid 
       order by a.Category, a.Date")

Note

The input DF in reproducible form is:

Lines <- "
Category  Value  Date    
apples        3  2018-07-01 
apples        2  2018-07-02 
apples        1  2018-07-03 
bananas       9  2018-07-01 
bananas       8  2018-07-02 
bananas       7  2018-07-03 
"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Would you explain where and how `RPostgreSQL` library is used in your first answer? How does `sqldf` know that it has to read from DF which is previously digested by `RPostgreSQL`? – Przemyslaw Remin Jul 24 '18 at 08:49
  • Are you sure that in your second answer the `library(sqldf)` should be used instead of `library(RSQLite)`? – Przemyslaw Remin Jul 24 '18 at 09:03
  • See the description of the sqldf `drv=` argument in `?sqldf` which describes how it determines the backend when `drv=` is not specified. Also see FAQ #12 on the sqldf github home page: https://github.com/ggrothendieck/sqldf – G. Grothendieck Jul 24 '18 at 11:40
  • I keep getting errors for the first answer `Error in postgresqlNewConnection(drv, ...) : RS-DBI driver: (could not connect postgres@localhost:5432 on dbname "test": could not connect to server: Connection refused (0x0000274D/10061)` The second answer works however it gives wrong results if two dates are equal. I expect results as ROW_NUMBER which return 1, 2, 3 even if three values per group are equal. I give up this SQL-like approach and move to `dplyr` solution. Thank you. – Przemyslaw Remin Jul 24 '18 at 12:20
  • The error message says it could not connect to the test database on the PostgreSQL server. Did you install PostgreSQL? Did you start the server after installing it? Did you create a test database? Did you read the material in my last comment? The example in the question suggests sequential unique dates. Please improve the question to accurately portray the problem if that is not the case. In the meantime I have modified the SQL in (2) to use rowid instead of Date assuming that it is already sorted by Date within category which seems to be the case based on your example. – G. Grothendieck Jul 24 '18 at 12:43
  • I like your second answer now. Works, accepted. The first answer is too much ado. I have learned from this post that dplyr and other more r-native solutions are far superior to sql-like solutions. Hat off my head. – Przemyslaw Remin Jul 24 '18 at 12:53
  • 1
    Your problems have nothing to do with sqldf and dplyr. Those are just R front ends. If you want to use PostgreSQL as a backend then you still have to install it, start it, configure it, etc. no matter what R front end you use. Also if you want to use SQLite then you won't be able to use the SQL code you show in the question no matter what front end you use (however, there is currently work going on in SQLite to support such functionality -- see http://www.sqlite.org/draft/windowfunctions.html ) – G. Grothendieck Jul 24 '18 at 13:06
  • 1
    I have just realized that you are the author. Face palm! Respect. – Przemyslaw Remin Jul 24 '18 at 13:33
  • @PrzemyslawRemin Thought it was clear from [my comment above](https://stackoverflow.com/questions/51475444/index-by-category-with-sorting-by-column-in-r-sqldf-package?noredirect=1#comment89940682_51475444) :) – zx8754 Jul 24 '18 at 13:49