147

I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).

DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
  V1 V2 V3
1  2  7  9
2  8  3  6
3  1  5  4

Now how do I get

> DF2
  RE
1 V3
2 V1
3 V2
zx8754
  • 52,746
  • 12
  • 114
  • 209
dmvianna
  • 15,088
  • 18
  • 77
  • 106

10 Answers10

142

One option using your data (for future reference, use set.seed() to make examples using sample reproducible):

DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))

colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"

A faster solution than using apply might be max.col:

colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"

...where ties.method can be any of "random" "first" or "last"

This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:

DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))

[[1]]
V2 V3 
 2  3 

[[2]]
V1 
 1 

[[3]]
V2 
 2 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis. – dmvianna Jul 18 '13 at 23:59
  • 2
    @dmvianna - using `which.max` will be fine then. – thelatemail Jul 19 '13 at 00:03
  • I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct? – dmvianna Jul 19 '13 at 00:05
  • `apply` converts the `data.frame` to `matrix` internally. You may not see a performance difference on these dimensions though. – Arun Jul 19 '13 at 00:07
  • @dmvianna - yep, should do. `apply(x,1,function)` will just go through each row in turn starting at 1 and finishing at `nrow(x)`. – thelatemail Jul 19 '13 at 00:07
  • @thelatemail : what if i have to find the 2nd largest value's column name ? – Pankaj Kaundal Apr 19 '16 at 07:46
  • 2
    @PankajKaundal - assuming distinct values, how about this `colnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]` – thelatemail Apr 19 '16 at 22:14
  • @thelatemail what if there are several rows in which all the values being compared are 0? how would you set a condition so that in this case, this field remains empty? – Pablo Herreros Cantis Jul 29 '20 at 19:15
  • @PabloHerrerosCantis - you might need to treat those separately - `rowSums(DF)==0` would be one way to flag them for replacement. – thelatemail Jul 29 '20 at 22:41
  • `apply(DF, 1, function(x) which(x == max(x))[1])` would presumably choose the first of equal maximum values. It does for your final example, though gives the column numbers rather than their names – Henry Mar 02 '23 at 15:23
20

One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.

library(tidyverse)

# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))

# If you aren't worried about ties:  
df %>% 
  rownames_to_column('id') %>%  # creates an ID number
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  slice(which.max(cnt)) 

# A tibble: 3 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 1     V3       9.
2 2     V1       8.
3 3     V2       5.


# If you're worried about keeping ties:
df %>% 
  rownames_to_column('id') %>%
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
  arrange(id)

# A tibble: 4 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 1     V3       9.
2 2     V1       8.
3 3     V2       5.
4 3     V3       5.


# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>% 
  rownames_to_column('id') %>%
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  mutate(dept_rank  = rank(-cnt, ties.method = "first")) %>% # or 'last'
  filter(dept_rank == 1) %>% 
  select(-dept_rank) 

# A tibble: 3 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 2     V1       8.
2 3     V2       5.
3 1     V3       9.

# if you wanted to keep the original wide data frame
df %>% 
  rownames_to_column('id') %>%
  left_join(
    df %>% 
      rownames_to_column('id') %>%
      gather(max_dept, max_cnt, V1:V3) %>% 
      group_by(id) %>% 
      slice(which.max(max_cnt)), 
    by = 'id'
  )

# A tibble: 3 x 6
  id       V1    V2    V3 max_dept max_cnt
  <chr> <dbl> <dbl> <dbl> <chr>      <dbl>
1 1        2.    7.    9. V3            9.
2 2        8.    3.    6. V1            8.
3 3        1.    5.    5. V2            5.
sbha
  • 9,802
  • 2
  • 74
  • 62
18

Based on the above suggestions, the following data.table solution worked very fast for me:

library(data.table)

set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))

system.time(
  DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#>    user  system elapsed 
#>    0.15    0.06    0.21
DT[]
#>          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#>       1:  7  4  1  2  3  7  6  6  6   1      V1
#>       2:  4  6  9 10  6  2  7  7  1   3      V4
#>       3:  3  4  9  8  9  9  8  8  6   7      V3
#>       4:  4  8  8  9  7  5  9  2  7   1      V4
#>       5:  4  3  9 10  2  7  9  6  6   9      V4
#>      ---                                       
#>  999996:  4  6 10  5  4  7  3  8  2   8      V3
#>  999997:  8  7  6  6  3 10  2  3 10   1      V6
#>  999998:  2  3  2  7  4  7  5  2  7   3      V4
#>  999999:  8 10  3  2  3  4  5  1  1   4      V2
#> 1000000: 10  4  2  6  6  2  8  4  7   4      V1

And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:

DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]

In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:

DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
Valentin_Ștefan
  • 6,130
  • 2
  • 45
  • 68
  • I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution? – user1412 Feb 26 '17 at 18:38
  • Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using the `which.min` in something that would look like: `DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]` or `DT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]` on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get. – Valentin_Ștefan Feb 28 '17 at 10:09
  • 1
    A trick to get minimum column is sending the negative of the data.frame into max.col, like: ```colnames(.SD)[max.col(-.SD, ties.method="first")]```. – lwshang Mar 09 '18 at 14:08
17

If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!

Here I've generated data of your dimensions (26746 * 18).

Data

set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))

data.table answer:

require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]

Benchmarking:

# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
#   user  system elapsed 
#  0.174   0.029   0.227 

# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
#   user  system elapsed 
#  2.322   0.036   2.602 

identical(t1, t2)
# [1] TRUE

It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.


Edit: if any of the max ids is okay, then:

DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
Arun
  • 116,683
  • 26
  • 284
  • 387
  • I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks! – dmvianna Jul 19 '13 at 00:48
17

One option from dplyr 1.0.0 could be:

DF %>%
 rowwise() %>%
 mutate(row_max = names(.)[which.max(c_across(everything()))])

     V1    V2    V3 row_max
  <dbl> <dbl> <dbl> <chr>  
1     2     7     9 V3     
2     8     3     6 V1     
3     1     5     4 V2     

In some contexts, it could be safer to use pmap() (requires purrr):

DF %>%
 mutate(row_max = pmap_chr(across(everything()), ~ names(c(...)[which.max(c(...))])))

Sample data:

DF <- structure(list(V1 = c(2, 8, 1), V2 = c(7, 3, 5), V3 = c(9, 6, 
4)), class = "data.frame", row.names = c(NA, -3L))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • +1 for your `purrr` solution I had to use `mutate(row_max = unlist(purrr::pmap(across(everything()), ~ names(c(...)[which.max(c(...))]))))` to get this as a column in my dataframe. Otherwise, it returns a list – mikey Oct 19 '22 at 18:32
  • 2
    @mikey if you use map_chr(), then I returns a character vector. I updated my post :) – tmfmnk Oct 19 '22 at 19:52
11

A dplyr solution:

Idea:

  • add rowids as a column
  • reshape to long format
  • filter for max in each group

Code:

DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>% 
  rownames_to_column() %>%
  gather(column, value, -rowname) %>%
  group_by(rowname) %>% 
  filter(rank(-value) == 1) 

Result:

# A tibble: 3 x 3
# Groups:   rowname [3]
  rowname column value
  <chr>   <chr>  <dbl>
1 2       V1         8
2 3       V2         5
3 1       V3         9

This approach can be easily extended to get the top n columns. Example for n=2:

DF %>% 
  rownames_to_column() %>%
  gather(column, value, -rowname) %>%
  group_by(rowname) %>% 
  mutate(rk = rank(-value)) %>%
  filter(rk <= 2) %>% 
  arrange(rowname, rk) 

Result:

# A tibble: 6 x 4
# Groups:   rowname [3]
  rowname column value    rk
  <chr>   <chr>  <dbl> <dbl>
1 1       V3         9     1
2 1       V2         7     2
3 2       V1         8     1
4 2       V3         6     2
5 3       V2         5     1
6 3       V3         4     2
Gregor Sturm
  • 2,792
  • 1
  • 25
  • 34
3

This is a fast and simple tidyverse solution, that can easily be applied to any subset of columns in a data.frame. The version below also uses ifelse to add missing values if all columns are 0. The missing values would be useful if, e.g., someone wants to use it to recombine one-hot encoded columns. It works on the data in the question, but here's an example of a one-hot encoded data set that it also works on.

data <- data.frame(
   oh_a = c(1,0,0,1,0,0)
  ,oh_b = c(0,1,1,0,0,0)
  ,oh_c = c(0,0,0,0,1,0)
  ,d = c("l","m","n","o","p","q"))

f <- function(x){ifelse(rowSums(x)==0, NA, names(x)[max.col(x, "first")])}
data %>% 
  mutate(transformed = f(across(starts_with("oh"))))

output:

  oh_a oh_b oh_c d transformed
1    1    0    0 l        oh_a
2    0    1    0 m        oh_b
3    0    1    0 n        oh_b
4    1    0    0 o        oh_a
5    0    0    1 p        oh_c
6    0    0    0 q        <NA>
2

A simple for loop can also be handy:

> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
  V1 V2 V3
1  2  7  9
2  8  3  6
3  1  5  4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+   df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
  V1
1 V3
2 V1
3 V2
rar
  • 894
  • 1
  • 9
  • 24
0

Here is an answer that works with data.table and is simpler. This assumes your data.table is named yourDF:

j1 <- max.col(yourDF[, .(V1, V2, V3, V4)], "first")
yourDF$newCol <- c("V1", "V2", "V3", "V4")[j1]

Replace ("V1", "V2", "V3", "V4") and (V1, V2, V3, V4) with your column names

Union find
  • 7,759
  • 13
  • 60
  • 111
0

This one is fast:

with(DF, {
  names(DF)[(V1 > V2 & V1 > V3) * 1 + (V2 > V3 & V2 > V1) * 2 + (V3 > V1 & V3 > V2)*3]
})
SmokeyShakers
  • 3,372
  • 1
  • 7
  • 18