5

I would like to obtain (in an new column in the data.table) the column name of the column that contains the maximum value in only a few columns in a data.frame.

Here is an example data.frame

# creating the vectors then the data frame ------
id = c("a", "b", "c", "d")
 ignore = c(1000,1000, 1000, 1000) 
 s1 = c(0,0,0,100)
s2 = c(100,0,0,0)
s3 = c(0,0,50,0)
s4 = c(50,0,50,0)
 df1 <- data.frame(id,ignore,s1,s2,s3,s4)  

(1) now I want to find the column name of the maximum number in each row, from the columns s1-s4. (i.e. ignore the column called "ignore")

(2) If there is a tie for the maximum, I would like the last (e.g. s4) column name returned.

(3) as an extra favour - if all are 0, I would ideally like NA returned

here is my best attempt so far

df2 <- cbind(df1,do.call(rbind,apply(df1,1,function(x) {data.frame(max.col.name=names(df1)[which.max(x)],stringsAsFactors=FALSE)})))

this returns ignore in each case, and (except for row b) works if I remove this column, and reorder the s1-s4 columns as s4-s1.

How would you approach this?

Many thanks indeed.

tjebo
  • 21,977
  • 7
  • 58
  • 94
threeisles
  • 301
  • 2
  • 8

2 Answers2

1

We use grep to create a column index for columns that start with 's' followed by numbers ('i1'). To get the row index of the subset dataset ('df1[i1]') that has the maximum value, we can use max.col with the option ties.method='last'. To convert the rows that have only 0 values to NA, we get the rowSums, check if that is 0 (==0) and convert those to NA (NA^) and multiply with max.col output. This can be used to extract the column names of subset dataset.

i1 <- grep('^s\\d+', names(df1))
names(df1)[i1][max.col(df1[i1], 'last')*NA^(rowSums(df1[i1])==0)]
#[1] "s2" NA   "s4" "s1"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you akrun. - If my understanding is correct the first line defines the list of columns to consider, and the second line performs the query. Wonderful, thanks again. – threeisles Sep 10 '15 at 14:08
  • thanks again - how would you modify that if there were NAs in some of the S1-S4 columns? I've tried preceeding it with na.omit or following it with na.rm = TRUE but can't get it to work. Any ideas? Many thanks – threeisles Sep 10 '15 at 14:45
  • @threeisles I guess you replace the `NA` with 0 and then try. i.e. `df1[is.na(df1)] <- 0` (not tested) – akrun Sep 10 '15 at 14:47
  • 1
    thanks I tried this: df[i1][is.na(df1[i1])] <- 0 and if worked perfectly – threeisles Sep 10 '15 at 14:53
1
library(dplyr)
library(tidyr)

df1 = data_frame(
  id = c("a", "b", "c", "d")
  ignore = c(1000,1000, 1000, 1000) 
  s1 = c(0,0,0,100)
  s2 = c(100,0,0,0)
  s3 = c(0,0,50,0)
  s4 = c(50,0,50,0))

result = 
  df1 %>%
  gather(variable, value, -id, -ignore) %>%
  group_by(id) %>%
  slice(value %>%
          {. == max(.)} %>%
          which %>%
          last) %>%
  ungroup %>%
  mutate(variable_fix = ifelse(value == 0,
                               NA,
                               variable))
bramtayl
  • 4,004
  • 2
  • 11
  • 18