10

I have a data set that looks like this

id  name    year    job    job2
1   Jane    1980    Worker  0
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Worker  0
2   Bob     1986    Worker  0
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0

Here, job2 denotes a dummy variable indicating whether a person was a Manager during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss for the first time. Second, I would like to see cumulative years a person worked as a Manager and store this information in the variable cumu_job2. Thus I would like to have:

id  name    year    job    job2 cumu_job2
1   Jane    1980    Worker  0   0
1   Jane    1981    Manager 1   1
1   Jane    1982    Manager 1   2
1   Jane    1983    Manager 1   3
1   Jane    1984    Manager 1   4
1   Jane    1985    Manager 1   5
1   Jane    1986    Boss    0   0
2   Bob     1985    Worker  0   0
2   Bob     1986    Worker  0   0
2   Bob     1987    Manager 1   1
2   Bob     1988    Boss    0   0

I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!

halo09876
  • 2,725
  • 12
  • 51
  • 71

5 Answers5

22

Here is the succinct dplyr solution for the same problem.

NOTE: Make sure that stringsAsFactors = FALSE while reading in the data.

library(dplyr)
dat %>%
  group_by(name, job) %>%
  filter(job != "Boss" | year == min(year)) %>%
  mutate(cumu_job2 = cumsum(job2))

Output:

   id name year     job job2 cumu_job2
1   1 Jane 1980  Worker    0         0
2   1 Jane 1981 Manager    1         1
3   1 Jane 1982 Manager    1         2
4   1 Jane 1983 Manager    1         3
5   1 Jane 1984 Manager    1         4
6   1 Jane 1985 Manager    1         5
7   1 Jane 1986    Boss    0         0
8   2  Bob 1985  Worker    0         0
9   2  Bob 1986  Worker    0         0
10  2  Bob 1987 Manager    1         1
11  2  Bob 1988    Boss    0         0

Explanation

  1. Take the dataset
  2. Group by name and job
  3. Filter each group based on condition
  4. Add cumu_job2 column.
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • 5
    The same idea on filtering can be used with `data.table`: `dt[, list(cum_job2=cumsum(job2[job!="Boss" | year==min(year)])), by=c('name', 'job')]` – Jean-Robert Jan 29 '14 at 08:24
  • @Ramnath I would like to know why this doesn't work for me -- i can't install dplyr and %.% is not a function. – halo09876 Feb 05 '14 at 13:02
  • `%.%` is a function in `dplyr`. `dplyr` is on CRAN, so it should be straightforward to install using `install_packages`. – Ramnath Feb 05 '14 at 14:13
  • @Ramnath Warning in install.packages : package ‘dplyr’ is not available (for R version 3.0.0) Which R version are you working on? – halo09876 Feb 05 '14 at 18:14
  • @Ramnath I also tried devtools::install_github("hadley/dplyr") but it says client error: 404 not found – halo09876 Feb 05 '14 at 18:24
  • So I updated R but this did not work - it just created cumulation of job2 until the very end (regardless of each observation) – halo09876 Feb 06 '14 at 22:41
  • I dont understand what output you got. One thing to be careful is to make sure that `plyr` is not loaded at the same time as `dplyr`, since it could lead to some conflicts. I have updated my answer with the output, which I believe is what you were looking for. – Ramnath Feb 06 '14 at 23:16
11

Contributed by Matthew Dowle:

dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
     by = list(name, job)]

Explanation

  1. Take the dataset
  2. Run a filter and add a column within each Subset of Data (.SD)
  3. Grouped by name and job

Older versions:

You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind. The main thing to note is the by=id piece, which basically means the other expressions are evaluated for each id grouping in the data, which was what you correctly noted was missing from your attempt.

library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L]  # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id]  # cumsum for manager time by person 
rbind(
  dt[job2==1],                     # this is just the manager portion of the data
  dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)]                 # order by id, year
#       id name year     job job2 cumujob
#   1:  1 Jane 1980 Manager    1       1
#   2:  1 Jane 1981 Manager    1       2
#   3:  1 Jane 1982 Manager    1       3
#   4:  1 Jane 1983 Manager    1       4
#   5:  1 Jane 1984 Manager    1       5
#   6:  1 Jane 1985 Manager    1       6
#   7:  1 Jane 1986    Boss    0       0
#   8:  2  Bob 1985 Manager    1       1
#   9:  2  Bob 1986 Manager    1       2
#  10:  2  Bob 1987 Manager    1       3
#  11:  2  Bob 1988    Boss    0       0

Note this assumes table is sorted by year within each id, but if it isn't that's easy enough to fix.


Alternatively you could also achieve the same with:

ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)] 

The idea is to basically get the row numbers where the condition matches (with .I - internal variable) and then subset dt on those row numbers (the $v1 part), then just perform the cumulative sum.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Thank you so much! I have one error though - Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1) – halo09876 Jan 29 '14 at 03:33
  • I don't really understand this because I did turn the id and job2 vectors into integer vectors through as.integer command.... – halo09876 Jan 29 '14 at 03:34
  • 1
    I read http://stackoverflow.com/questions/16361225/unable-to-perform-calculations-using-r-data-table-package and solved the problem - simply did cumujob:=as.numeric(cumsum(job2)) instead. – halo09876 Jan 29 '14 at 03:42
  • 1
    As far as readability goes, I'd go with: `dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by = list(name, job)]` – eddi Jan 29 '14 at 16:18
  • @eddi Hi -- I have one more question! So if I have other positions than Managers, how would I be able to preserve all information? I haven't well specified this part in my question, but I think dt[job2==1] doing this is throwing a lot of information away in my data set. – halo09876 Feb 05 '14 at 13:10
  • For the last code I'm getting Error in `[.data.frame`(exemptacc, , .SD[ccmem == 0 | year == min(year)][, : unused argument (by = list(name, prov.1)) – halo09876 Feb 05 '14 at 13:18
  • THe second code causes a same error too -- unused argument (by = list(name1, ccmem)) – halo09876 Feb 05 '14 at 13:34
  • @Rusuer9000, make sure you add the line `library(data.table)` before you run that line. The last one works fine for me. I haven't tested the second one. If the package is not installed, then you will need to run `install.packages("data.table")` first. – BrodieG Feb 05 '14 at 13:43
  • @Rusuer9000 you already have that information in the data, so instead of thinking of how to represent that information as one number, how about skipping that part, and going straight to the next step of what you want to do. – eddi Feb 05 '14 at 16:13
3

Here is a base solution using within and ave. We assume that the input is DF and that the data is sorted as in the question.

DF2 <- within(DF, {
    seq = ave(id, id, job, FUN = seq_along)
    job2 = (job == "Manager") + 0
    cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)

REVISION: Now uses within.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

I think this does what you want, although the data must be sorted as you have presented it.

my.df <- read.table(text = '
id  name    year    job    job2
1   Jane    1980    Worker  0
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Worker  0
2   Bob     1986    Worker  0
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0
', header = TRUE, stringsAsFactors = FALSE)

my.seq <- data.frame(rle(my.df$job)$lengths)

my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))

my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0

   id name year     job job2 cumu_job2
1   1 Jane 1980  Worker    0         0
2   1 Jane 1981 Manager    1         1
3   1 Jane 1982 Manager    1         2
4   1 Jane 1983 Manager    1         3
5   1 Jane 1984 Manager    1         4
6   1 Jane 1985 Manager    1         5
7   1 Jane 1986    Boss    0         0
9   2  Bob 1985  Worker    0         0
10  2  Bob 1986  Worker    0         0
11  2  Bob 1987 Manager    1         1
12  2  Bob 1988    Boss    0         0
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
0

@BrodieG's is way better:

The Data

dat <- read.table(text="id  name    year    job    job2
1   Jane    1980    Manager 1
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Manager 1
2   Bob     1986    Manager 1
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0", header=TRUE)

#The code:

inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1

ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
    if(x$job2[1] == 0){ 
        x$cumu_job2 <- rep(0, nrow(x))
    } else { 
        x$cumu_job2 <- 1:nrow(x)
    }
    x
}))


keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2

##    id name year     job job2 cumu_job2
## 1   1 Jane 1980 Manager    1         1
## 2   1 Jane 1981 Manager    1         2
## 3   1 Jane 1982 Manager    1         3
## 4   1 Jane 1983 Manager    1         4
## 5   1 Jane 1984 Manager    1         5
## 6   1 Jane 1985 Manager    1         6
## 7   2 Jane 1986    Boss    0         0
## 8   3  Bob 1985 Manager    1         1
## 9   3  Bob 1986 Manager    1         2
## 10  3  Bob 1987 Manager    1         3
## 11  4  Bob 1988    Boss    0         0
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519