1

I am trying to do something in R that is equivalent to a "first value" windowing function in SQL, such as:

select *, first_value(value3) over (partition by value1, value2 order by value5 desc)
from df

Is there a good way to do this in R without having to create a database to use something like sqldf/RPostgreSQL?

I've seen some other examples like the answers here: Cumulative sum by group in sqldf?

but I'm having some trouble figuring out how to do it with a function where the order within the window matters. Another thing is that I'm working with ~500K rows, and so performance is an concern.

Edit: here's an example:

item_id  rental_date   customer_id
I001     10/20/2012    1
I002     10/05/2012    4 
I001     10/15/2012    3
I004     10/19/2012    1
I001     10/11/2012    6
I002     9/15/2012     5
I004     10/13/2012    10
I001     9/30/2012     4

How would I determine the first customer to rent a given item for each month?

Community
  • 1
  • 1
bobfet1
  • 1,603
  • 21
  • 22
  • 1
    Can you expand the description for those of us who aren't proficient in SQL? – Roman Luštrik Oct 30 '12 at 12:41
  • you can modify the answers in the linked question to return the first row. It will look something like `dt[, .SD[1,], by = yourKey]`. There are some good answers [here](http://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier/7889#7889). You'll probably want to do some manipulation on the date column to extract the month. Do you have multiple years worth of data? i.e. need to analyze Jan 2011 separate from Jan 2012? Something to keep in mind... – Chase Oct 30 '12 at 13:07

2 Answers2

4

If by not using sqldf/PostgreSQL you mean using sqldf but using SQLite instead of PostgreSQL then try this (which relies on a new feature of SQLite added over the last year in which if there is a min or max then the other columns are guaranteed to be from the same row):

Lines <- "item_id  rental_date   customer_id
I001     10/20/2012    1
I002     10/05/2012    4 
I001     10/15/2012    3
I004     10/19/2012    1
I001     10/11/2012    6
I002     9/15/2012     5
I004     10/13/2012    10
I001     9/30/2012     4"

DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
DF$rental_date <- as.Date(DF$rental_date, "%m/%d/%Y")
DF$ym <- format(DF$rental_date, "%Y-%m")

sqldf("select item_id, ym, customer_id, min(rental_date) rental_date
    from DF 
    group by item_id, ym")

The result in this case is:

  item_id      ym customer_id      rental_date
1    I001 2012-09           4       2012-09-30
2    I001 2012-10           6       2012-10-11
3    I002 2012-09           5       2012-09-15
4    I002 2012-10           4       2012-10-05
5    I004 2012-10          10       2012-10-13
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

I assume that the object in your example is a data.frame which we call df.

library( "plyr" )
df$rental_date <- as.Date( df$rental_date, "%m/%d/%Y" )
df$year <-  as.numeric( format( df$rental_date, "%Y"))
df$month <- months( df$rental_date )

ddply( df, c("item_id", "month", "year"), function(x) {
  x[ min(x$rental_date) == x$rental_date, "customer_id", drop=FALSE ]
} )

The result should look like:

  item_id     month year customer_id
1    I001   October 2012           6
2    I001 September 2012           4
3    I002   October 2012           4
4    I002 September 2012           5
5    I004   October 2012          10
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • sorry just saw that you wanted the first customer per item and month. Fixed it. – Beasterfield Oct 30 '12 at 14:33
  • If they're concerned about performance, plyr is probably not the best solution. – Joshua Ulrich Oct 30 '12 at 14:54
  • @JoshuaUlrich in general thats true. But I would always sacrifice a couple of runtime seconds for a simple and readable code. Plus, there ars still some tweaks which boost `ddply` as described here: http://stackoverflow.com/a/3685919/766828 – Beasterfield Oct 30 '12 at 15:02
  • But those "tweaks" make your code less simple and readable... and it depends on how often you have to run the code. Your `ddply` solution takes ~15s on my machine, while the answer by @G.Grothendieck takes 3s. That's with 500k rows, 100 items, 1000 days, and 1000 customers. – Joshua Ulrich Oct 30 '12 at 15:33
  • Ok, I don't have such a large data.frame at hand at the moment, that I could measure, but the tweaks I am talking about are `ddply( idata.frame(df), c("item_id", "month", "year"), .parallel=TRUE, ...` Don't see how they are less readable. But I must also admit, that I didn't know about `sqldf`. – Beasterfield Oct 30 '12 at 16:36
  • Using `idata.frame` causes an error and `.parallel=TRUE` is much slower (using `doParallel` on Windows). :-S – Joshua Ulrich Oct 30 '12 at 16:53