3

I have a dataframe like this

df <- expand.grid(0:1, 0:1, 0:1, 0:1)
df
   Var1 Var2 Var3 Var4
1     0    0    0    0
2     1    0    0    0
3     0    1    0    0
4     1    1    0    0
5     0    0    1    0
6     1    0    1    0
7     0    1    1    0
8     1    1    1    0
9     0    0    0    1
10    1    0    0    1
11    0    1    0    1
12    1    1    0    1
13    0    0    1    1
14    1    0    1    1
15    0    1    1    1
16    1    1    1    1

I am trying to create a Rank column based on some conditions on Var1, Var2, Var3, Var4

The order of ranking precedence is determined by the variables

  • Column Var1 has the highest preference and if it has a value of 1, then it is given a higher rank
  • Column Var2 has a higher preference over Var3, Var4
  • Columns Var1 and Var2 have higher preference over Var3, Var4
  • There is NO preference given to Var3 and Var4 and are only used as counts for ranking

If any rows have the same counts for Var3, Var4, then they are ranked with the same number.

My desired output is

   Var1 Var2 Var3 Var4 rank
1     0    0    0    0   12
2     1    0    0    0    6
3     0    1    0    0    9
4     1    1    0    0    3
5     0    0    1    0   11
6     1    0    1    0    5
7     0    1    1    0    8
8     1    1    1    0    2
9     0    0    0    1   11
10    1    0    0    1    5
11    0    1    0    1    8
12    1    1    0    1    2
13    0    0    1    1   10
14    1    0    1    1    4
15    0    1    1    1    7
16    1    1    1    1    1

I am trying to this manually but it is not very efficient

df %>%
  mutate(rank = case_when(
    Var1 == 1 & Var2 == 1 & Var3 == 1 & Var4 == 1~ "1",
    Var1 == 1 & Var2 == 1 & Var3 == 1 & Var4 == 0~ "2",
    TRUE                                     ~ ""
  ))

I want to apply the logic to a larger dataset. Is there an efficient way to do this? Can someone point me in the right direction?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Sharath
  • 2,225
  • 3
  • 24
  • 37

2 Answers2

2

frank and frankv in data.table "accepts vectors, lists, data.frames or data.tables as input", which can be useful here.

First, frankv. It has a cols argument where columns to be ranked can be specified in a character vector - convenient if there are many column names which need to be generated programmatically. It also has a neat order argument.

library(data.table)
setDT(df)
df[ , Var34 := Var3 + Var4]
cols = c("Var1", "Var2", "Var34")
df[ , r := frankv(.SD, cols, order = -1L, ties.method = "dense")]
df[ , Var34 := NULL]

#     Var1 Var2 Var3 Var4  r
#  1:    0    0    0    0 12
#  2:    1    0    0    0  6
#  3:    0    1    0    0  9
#  4:    1    1    0    0  3
#  5:    0    0    1    0 11
#  6:    1    0    1    0  5
#  7:    0    1    1    0  8
#  8:    1    1    1    0  2
#  9:    0    0    0    1 11
# 10:    1    0    0    1  5
# 11:    0    1    0    1  8
# 12:    1    1    0    1  2
# 13:    0    0    1    1 10
# 14:    1    0    1    1  4
# 15:    0    1    1    1  7
# 16:    1    1    1    1  1

frank is handy for interactive use:

df[ , r := frank(.SD, -Var1, -Var2, -Var34, ties.method = "dense")]

Related answers: How to emulate SQLs rank functions in R?; Rank based on several variables

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • 1
    Thanks for this solution :-) data.table continues to amaze me with its simplicity and execution. I am now applying your logic to my large data frame. I will report back – Sharath Jul 09 '20 at 18:53
  • 1
    It works perfectly on my bigger dataset. Thank you so much :-) – Sharath Jul 09 '20 at 19:22
0

I propose this which is a small trick :

df <- expand.grid(0:1, 0:1, 0:1, 0:1)

df[,2] <- df[,2] * 10
df[,3] <- df[,3] * 100
df[,4] <- df[,4] * 100

rank <- rowSums(df)
as.numeric(as.factor(rank))
Rémi Coulaud
  • 1,684
  • 1
  • 8
  • 19