Returning the Date of First and Last Occurrence of a Factor Variable

Question

Problem

I have a data frame where each row marks an exchange between companies, where companies give and receive something on a given date (they can give to a different company or to themselves). From that, I'd like to create a new data frame with columns indicating when a company first started giving, when it first stopped giving, when it first started receiving, and when it first stopped receiving. Here is a sample data frame of what I'm starting with:

Sample Starting Data

samp <- structure(list(giver = structure(c(1L, 2L, 6L, 3L, 1L, 3L, 4L, 1L, 6L, 1L, 5L), .Label = c("A", "B", "C", "X", "Y", "Z"), class = "factor"), receiver = structure(c(1L, 2L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"), date = structure(c(1L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 9L), .Label = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-07", "2000-01-08", "2000-01-09"), class = "factor")), .Names = c("giver", "receiver", "date"), class = "data.frame", row.names = c(NA, -11L))
samp$date <- as.Date(samp$date, "%Y-%m-%d") # Format date variable

samp 
giver receiver   date
A        A     2000-01-01
B        B     2000-01-01
Z        B     2000-01-02
C        C     2000-01-03
A        A     2000-01-04
C        C     2000-01-05
X        C     2000-01-06
A        A     2000-01-07
Z        B     2000-01-08
A        A     2000-01-09
Y        B     2000-01-09

However, I am having trouble figuring out how to scan one column for the first and last occurrences of each company and return the date value of a different column. I found similar questions here and here using match, duplicated, or tapply but can't quite get them to fit with what I am trying to do. Here is a sample data frame of what I'm hoping to end up with:

Desired Ending Data

desire <- structure(list(company = structure(1:6, .Label = c("A", "B", "C", "X", "Y", "Z"), class = "factor"), start.giving = structure(c(1L, 1L, 3L, 4L, 5L, 2L), .Label = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05", "2000-01-09"), class = "factor"), stop.giving = structure(c(5L, 1L, 2L, 3L, 5L, 4L), .Label = c("2000-01-01", "2000-01-05", "2000-01-06", "2000-01-08", "2000-01-09"), class = "factor"), start.receiving = structure(c(1L, 1L, 2L, NA, NA, NA), .Label = c("2000-01-01", "2000-01-03"), class = "factor"), stop.receiving = structure(c(2L, 2L, 1L, NA, NA, NA), .Label = c("2000-01-06", "2000-01-09"), class = "factor")), .Names = c("company", "start.giving", "stop.giving", "start.receiving", "stop.receiving"), class = "data.frame", row.names = c(NA, -6L))

desire 
company start.giving stop.giving start.receiving stop.receiving
A       2000-01-01   2000-01-09      2000-01-01     2000-01-09
B       2000-01-01   2000-01-01      2000-01-01     2000-01-09
C       2000-01-03   2000-01-05      2000-01-03     2000-01-06
X       2000-01-05   2000-01-06            <NA>           <NA>
Y       2000-01-09   2000-01-09            <NA>           <NA>
Z       2000-01-02   2000-01-08            <NA>           <NA>

score 4 · Accepted Answer · edited Apr 22 '15 at 18:01

Here's a bit simplified version using the data.table package

library(data.table)
setDT(samp)
Res1 <- samp[, .(start = min(date), stop = max(date)), by = .(company = giver)]
Res2 <- samp[, .(start = min(date), stop = max(date)), by = .(company = receiver)]
merge(Res1, Res2, by = "company", all = TRUE, suffixes = c(".giving", ".receiving"))
#    company start.giving stop.giving start.receiving stop.receiving
# 1:       A   2000-01-01  2000-01-09      2000-01-01     2000-01-09
# 2:       B   2000-01-01  2000-01-01      2000-01-01     2000-01-09
# 3:       C   2000-01-03  2000-01-05      2000-01-03     2000-01-06
# 4:       X   2000-01-06  2000-01-06            <NA>           <NA>
# 5:       Y   2000-01-09  2000-01-09            <NA>           <NA>
# 6:       Z   2000-01-02  2000-01-08            <NA>           <NA>

Ben Bolker · Answer 2 · 2015-04-22T21:06:11.983

The dplyr version:

library("dplyr")
giving <- samp %>% group_by(giver) %>%
    summarise(start.giving=min(date),
              stop.giving=max(date)) %>%
       rename(company=giver)
receiving <- samp %>% group_by(receiver) %>%
    summarise(start.receiving=min(date),
              stop.receiving=max(date)) %>%
       rename(company=receiver)
full_join(giving,receiving)

With a little more work it's possible to condense this still further/not repeat all of the summarise code (analogous to the foo() function in @Arun's answer) ...

foo <- function(x,f) {
    ss <- c("start","stop")
    group_by_(x,.dots=f) %>%
       summarise(start=min(date),
                 stop=max(date)) %>%
       rename_(.dots=c(company=f,
                       setNames(ss,paste(ss,f,sep="."))))
 }
 full_join(foo(samp,"giver"),foo(samp,"receiver"))

... although the code is now less transparent, and not actually any shorter ... it would be worth it if you were going to do this sort of thing a lot.

In the creation of the `receiving` object, I think you have an extra parenthesis after `(start.receiving=min(date)),`. It should be `(start.receiving=min(date),`, right? Also, for some reason I get an error with the `full_join()` command: "Error: could not find function 'full_join'" even though clearly `dplyr` is loaded. — coip, Apr 22 '15 at 20:40
Thx for correction. The names of the join/merge functions have been rather fluid in recent versions of `dplyr`: https://github.com/hadley/dplyr/issues/797 . You might try `outer_join()` instead. — Ben Bolker, Apr 22 '15 at 20:56

Arun · Answer 3 · 2015-04-22T18:25:37.520

Using devel version of data.table, 1.9.5, here's another version using the new feature of dcast:

require(data.table) ## v1.9.5+
foo <- function(x, col) {
    ans <- dcast(x, paste(col, "~ ."), value.var="date", fun=list(min, max))
    setnames(ans, c("company", "start", "stop"))
}
setDT(samp)
merge(foo(samp, "giver"), foo(samp, "receiver"), by = "company", 
        all=TRUE, suffixes=c(".giving", ".receiving"))
#    company start.giving stop.giving start.receiving stop.receiving
# 1:       A   2000-01-01  2000-01-09      2000-01-01     2000-01-09
# 2:       B   2000-01-01  2000-01-01      2000-01-01     2000-01-09
# 3:       C   2000-01-03  2000-01-05      2000-01-03     2000-01-06
# 4:       X   2000-01-06  2000-01-06            <NA>           <NA>
# 5:       Y   2000-01-09  2000-01-09            <NA>           <NA>
# 6:       Z   2000-01-02  2000-01-08            <NA>           <NA>

coip · Answer 4 · 2015-04-23T02:18:59.627

This can be done in base R with the aggregate and merge commands:

# Import starting sample data
samp <- structure(list(giver = structure(c(1L, 2L, 6L, 3L, 1L, 3L, 4L, 1L, 6L, 1L, 5L), .Label = c("A", "B", "C", "X", "Y", "Z"), class = "factor"), receiver = structure(c(1L, 2L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"), date = structure(c(1L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 9L), .Label = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-07", "2000-01-08", "2000-01-09"), class = "factor")), .Names = c("giver", "receiver", "date"), class = "data.frame", row.names = c(NA, -11L))
samp$date <- as.Date(samp$date, "%Y-%m-%d") # Format date variable

# Find first and last occurrence by date
g1 <- aggregate(samp$date, list(samp$giver), min)
colnames(g1)[1] = "company"
colnames(g1)[2] = "start.giving"

g2 <- aggregate(samp$date, list(samp$giver), max)
colnames(g2)[1] = "company"
colnames(g2)[2] = "stop.giving"

s1 <- aggregate(samp$date, list(samp$receiver), min)
colnames(s1)[1] = "company"
colnames(s1)[2] = "start.receiving"

s2 <- aggregate(samp$date, list(samp$receiver), max)
colnames(s2)[1] = "company"
colnames(s2)[2] = "stop.receiving"

# Merge data frames by company name
a1 <- merge(g1, g2, by=c("company"))
b1 <- merge(s1, s2, by=c("company"))
c1 <- merge(a1, b1, by=c("company"), all.x = TRUE)

c1 # Display desired data frame
company start.giving stop.giving start.receiving stop.receiving
 A       2000-01-01  2000-01-09     2000-01-01     2000-01-09
 B       2000-01-01  2000-01-01     2000-01-01     2000-01-09
 C       2000-01-03  2000-01-05     2000-01-03     2000-01-06
 X       2000-01-06  2000-01-06         <NA>           <NA>
 Y       2000-01-09  2000-01-09         <NA>           <NA>
 Z       2000-01-02  2000-01-08         <NA>           <NA>

Returning the Date of First and Last Occurrence of a Factor Variable

4 Answers4