29

Suppose we have the following data. The rows represent a country and the columns (in05:in09) indicate whether that country was present in a database of interest in the given year (2005:2009).

id <- c("a", "b", "c", "d")
in05 <- c(1, 0, 0, 1)
in06 <- c(0, 0, 0, 1)
in07 <- c(1, 1, 0, 1)
in08 <- c(0, 1, 1, 1)
in09 <- c(0, 0, 0, 1)
df <- data.frame(id, in05, in06, in07, in08, in09)

I want to create a variable firstyear which indicates the first year in which the country was present in the database. Right now I do the following:

df$firstyear <- ifelse(df$in05==1,2005,
    ifelse(df$in06==1,2006,
        ifelse(df$in07==1, 2007,
            ifelse(df$in08==1, 2008,
                ifelse(df$in09==1, 2009,
                    0)))))

The above code is already not very nice, and my dataset contains many more years. Is there an alternative, using *apply functions, loops or something else, to create this firstyear variable?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Kat
  • 507
  • 1
  • 8
  • 15

7 Answers7

25

You can vectorize using max.col

indx <- names(df)[max.col(df[-1], ties.method = "first") + 1L]
df$firstyear <- as.numeric(sub("in", "20", indx))
df
#   id in05 in06 in07 in08 in09 firstyear
# 1  a    1    0    1    0    0      2005
# 2  b    0    0    1    1    0      2007
# 3  c    0    0    0    1    0      2008
# 4  d    1    1    1    1    1      2005
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 9
    Good old `max.col` - always comes to the rescue. Though it's really quite annoying that it defaults to `"random"` for dealing with ties, considering `which.max` / `which.min` etc always take the first result they hit. – thelatemail May 27 '15 at 23:13
21
df$FirstYear <- gsub('in', '20', names(df))[apply(df, 1, match, x=1)]
df
  id in05 in06 in07 in08 in09 FirstYear
1  a    1    0    1    0    0      2005
2  b    0    0    1    1    0      2007
3  c    0    0    0    1    0      2008
4  d    1    1    1    1    1      2005

There are many ways to do it. I used match because it will find the first instance of a specified value. The other parts of the code are for presentation. First going line by line with apply and naming the years by the column names with names. The assignment <- and df$FirstYear is a way to add the result to the data frame.

added credit @David Arenburg has a cool idea about subbing the in for 20 in the FirstYear column.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
8

Another answer with some notes of efficiency (although this QA is not about speed).

Firstly, it could be better to avoid the conversion of a "list"-y structure to a "matrix"; sometimes it's worth to convert to a "matrix" and use a function that handles efficiently a 'vector with a "dim" attribute' (i.e. a "matrix"/"array") - other times it's not. Both max.col and apply convert to a "matrix".

Secondly, in situations like these, where we do not need to check all the data while getting to a solution, we could benefit from a solution with a loop that controls what goes through to the next iteration. Here we know that we can stop when we've found the first "1". Both max.col (and which.max) have to loop once to, actually, find the maximum value; the fact that we know that "max == 1" is not taken advantage of.

Thirdly, match is potentially slower when we seek only one value in another vector of values because match's setup is rather complicated and costly:

x = 5; set.seed(199); tab = sample(1e6)
identical(match(x, tab), which.max(x == tab))
#[1] TRUE
microbenchmark::microbenchmark(match(x, tab), which.max(x == tab), times = 25)
#Unit: milliseconds
#                expr       min        lq    median        uq       max neval
#       match(x, tab) 142.22327 142.50103 142.79737 143.19547 145.37669    25
# which.max(x == tab)  18.91427  18.93728  18.96225  19.58932  38.34253    25

To sum up, a way to work on the "list" structure of a "data.frame" and to stop computations when we find a "1", could be a loop like the following:

ff = function(x)
{
    x = as.list(x)
    ans = as.integer(x[[1]])
    for(i in 2:length(x)) {
        inds = ans == 0L
        if(!any(inds)) return(ans)
        ans[inds] = i * (x[[i]][inds] == 1)
    }
    return(ans)
}

And the solutions in the other answers (ignoring the extra steps for the output):

david = function(x) max.col(x, "first")
plafort = function(x) apply(x, 1, match, x = 1)

ff(df[-1])
#[1] 1 3 4 1
david(df[-1])
#[1] 1 3 4 1
plafort(df[-1])
#[1] 1 3 4 1

And some benchmarks:

set.seed(007)
DF = data.frame(id = seq_len(1e6),
                "colnames<-"(matrix(sample(0:1, 1e7, T, c(0.25, 0.75)), 1e6), 
                             paste("in", 11:20, sep = "")))
identical(ff(DF[-1]), david(DF[-1]))
#[1] TRUE
identical(ff(DF[-1]), plafort(DF[-1]))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF[-1]), david(DF[-1]), as.matrix(DF[-1]), times = 30)
#Unit: milliseconds
#              expr       min        lq    median        uq       max neval
#        ff(DF[-1])  64.83577  65.45432  67.87486  70.32073  86.72838    30
#     david(DF[-1]) 112.74108 115.12361 120.16118 132.04803 145.45819    30
# as.matrix(DF[-1])  20.87947  22.01819  27.52460  32.60509  45.84561    30

system.time(plafort(DF[-1]))
#   user  system elapsed 
#  4.117   0.000   4.125 

Not really an apocalypse, but worth to see that simple, straightforward algorithmic approaches can -indeed- prove to be equally good or even better depending on the problem. Obviously, (most) other times looping in R can be laborious.

alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • 4
    Brilliant. As always... Long time ago I was writing very efficient loops on lists and my code was blazing fast, but SO ruined me with the "anti-loop" philosophy :) – David Arenburg Jun 22 '15 at 16:24
  • 3
    @DavidArenburg : Looping is a way of life - you can hide it, 'vectorize' it but you can't avoid it.. :-) – alexis_laz Jun 23 '15 at 19:54
  • nice answer; you might be interested in the commentary in my answer. – BrodieG Jun 25 '15 at 13:34
8

You can use dplyr::case_when inside dplyr::mutate() along the lines of the method presented in this tweet.

# Using version 0.5.0.
# Dev version may work without `with()`.    
df %>%
      mutate(., firstyear = with(., case_when(
        in05 == 1 ~ 2005,
        in06 == 1 ~ 2006,
        in07 == 1 ~ 2007,
        in08 == 1 ~ 2008,
        in09 == 1 ~ 2009,
        TRUE ~ 0
)))
seasmith
  • 889
  • 9
  • 16
4

Here is another option:

years <- as.integer(substr(names(df[-1]), 3, 4)) + 2000L
cbind(df, yr=do.call(pmin.int, Map(`/`, years, df[-1])))

Produces:

  id in05 in06 in07 in08 in09   yr
1  a    1    0    1    0    0 2005
2  b    0    0    1    1    0 2007
3  c    0    0    0    1    0 2008
4  d    1    1    1    1    1 2005

And is fast. Here timing only the finding the min year step using Alexis' data:

Unit: milliseconds
                                       expr       min       lq   median       uq      max neval
 do.call(pmin.int, Map(`/`, 11:20, DF[-1])) 178.46993 194.3760 219.8898 229.1597 307.1120    10
                                 ff(DF[-1]) 416.07297 434.0792 439.1970 452.8345 496.2048    10
                   max.col(DF[-1], "first")  99.71936 138.2285 175.2334 207.6365 239.6519    10

Oddly this doesn't reproduce Alexis' timings, showing David's as the fastest. This is on R 3.1.2.


EDIT: based on convo with Frank, I updated Alexis function to be more compatible with R 3.1.2:

ff2 = function(x) {
  ans = as.integer(x[[1]])
  for(i in 2:length(x)) {
      inds = which(ans == 0L)
      if(!length(inds)) return(ans)
      ans[inds] = i * (x[[i]][inds] == 1)
  }
  return(ans)
}

And this comes closer to the original results:

Unit: milliseconds
        expr       min        lq    median        uq      max neval
  ff(DF[-1]) 407.92699 415.11716 421.18274 428.02092 462.2474    10
 ff2(DF[-1])  64.20484  72.74729  79.85748  81.29153 148.6439    10
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Interesting. Maybe an R version thing. When I run `microbenchmark(do.call(pmin.int, Map(\`/\`, 11:20, DF[-1])),ff(DF[-1]),max.col(DF[-1], "first"),times=10)` on R 3.2.0 with Alexis' example data I get Alexis 150, Brodie 275, David 430 (for the mean or median). – Frank Jun 25 '15 at 12:50
  • @Frank Hmm, I guess I'll check this gain when I upgrade, though the truly mystifying thing is that `max.col` gets _slower_. – BrodieG Jun 25 '15 at 13:13
  • 1
    @Frank, I have a theory as to why `Alexis` is faster. I think R 3.2.0 is smarter about `x[logical]` vs `x[which(logical)]`. The latter is traditionally much faster. On my system `x <- logical(1e5); x[sample(1e5, 1e4)] <- TRUE; microbenchmark(x[which(x)], x[x])` is 8x faster for the `which` version. Can you run that on yours? – BrodieG Jun 25 '15 at 13:26
  • Yeah, on mine I have multiple versions of R. 3.2.0 -- 2x as fast for `which`; 3.0.1 -- 3-4x – Frank Jun 25 '15 at 13:29
  • 1
    On R-3.2.0, I get `ff: 100 ms | ff2: 65 ms`; on R-3.1.2, `ff: 230 ms | ff2: 75ms`. I'm not able to get my initial 65 ms for ff. For the `x[logical]` VS `x[which(logical)]`, `x[which(x)]` is 3 ms on both versions of R, but `x[x]` is 5 ms on R-3.2.0 VS 12 ms on R-3.1.2. Looking at [R-3.1.2's `logicalSubscript`](http://svn.r-project.org/R/tags/R-3-1-2/src/main/subscript.c) VS [R-3.2.0's `logicalSubscript`](http://svn.r-project.org/R/tags/R-3-2-0/src/main/subscript.c), it seems that R-3.2.0 avoids extended use of "%" in returning integer indices from the logical ones (to be -subsequently- used). – alexis_laz Jun 26 '15 at 11:40
2

I always prefer to work with tidied data. First method filters on cumsums

# Tidy
df <- df %>% 
  gather(year, present.or.not, -id) 

# Create df of first instances
first.df <- df %>% 
  group_by(id, present.or.not) %>% 
  mutate(ranky = rank(cumsum(present.or.not)), 
         first.year = year) %>% 
  filter(ranky == 1)

# Prepare for join
first.df <- first.df[,c('id', 'first.year')]

# Join with original
df <- left_join(df,first.df)

# Spread
spread(df, year, present.or.not)

Or this alternative that, after tidying, slices the first row from arranged groups.

df %>% 
  gather(year, present_or_not, -id) %>% 
  filter(present_or_not==1) %>% 
  group_by(id) %>% 
  arrange(id, year) %>% 
  slice(1) %>% 
  mutate(year = str_replace(year, "in", "20")) %>% 
  select(1:2) %>% 
  right_join(df)`
Nettle
  • 3,193
  • 2
  • 22
  • 26
0

Other messy alternatives:

library(tidyr)
library(sqldf)
    newdf <- gather(df, year, code, -id)
    df$firstyear <- sqldf('SELECT min(rowid) rowid, id, year as firstyear
                            FROM newdf 
                            WHERE code = 1
                            GROUP BY id')[3]

library(tidyr)
df2 <- gather(df, year, code, -id)
df2 <- df2[df2$code == 1, 1:2]
df2 <- df2[!duplicated(df2$id), ]
merge(df, df2)

library(tidyr)
library(dplyr)
    newdf <- gather(df, year, code, -id)
    df$firstyear <- (newdf %>% 
                      filter(code==1) %>%
                      select(id, year) %>%
                      group_by(id) %>%
                      summarise(first = first(year)))[2]

Output:

  id in05 in06 in07 in08 in09 year
1  a    1    0    1    0    0 in05
2  b    0    0    1    1    0 in07
3  c    0    0    0    1    0 in08
4  d    1    1    1    1    1 in05

A cleaner solution combining plaforts solution with alexises_laz is:

names(df) <- c("id", 2005, 2006, 2007, 2008, 2009)
df$firstyear <- names(df[-1])[apply(df[-1], 1, which.max)] 

  id 2005 2006 2007 2008 2009 firstyear
1  a    1    0    1    0    0      2005
2  b    0    0    1    1    0      2007
3  c    0    0    0    1    0      2008
4  d    1    1    1    1    1      2005

If we'd like to keep the original column names we could use the renaming provided by @David Arenburg.

df$firstYear <- gsub('in', '20', names(df[-1]))[apply(df[-1], 1, which.max)]

  id in05 in06 in07 in08 in09 firstYear
1  a    1    0    1    0    0      2005
2  b    0    0    1    1    0      2007
3  c    0    0    0    1    0      2008
4  d    1    1    1    1    1      2005
mpalanco
  • 12,960
  • 2
  • 59
  • 67