How to label by combination of available data in each row in R

Question

So I have three columns of Boolean values. I want to produce a fourth column that contains labels for the different arrangements of data I could have. Paint-by-numbers, if you will.

Example:

A  |  B  |  C  |  newCol
------------------------
0  |  0  |  0  |   0
1  |  0  |  0  |   1
0  |  1  |  0  |   2
0  |  0  |  1  |   3
1  |  1  |  0  |   4
0  |  1  |  1  |   5
1  |  0  |  1  |   6
1  |  1  |  1  |   7

So, based on the arrangement that happens among A, B, and C to have a corresponding label.

Preferably using tidyverse approach.

No, as it is a distinct combination. 3 refers to when only C is true. 4 refers to when A and B are true. — user26481, Feb 25 '19 at 22:03
This must be a duplicate. One method in base R is to use `factor`, `Reduce`, and `paste`. Something like `dat$newCol <- as.integer(factor(Reduce(paste, dat)))` will be pretty close. — lmo, Feb 25 '19 at 22:04
@M-M thank you, that works great! If I add a dummy section at the top for my desired grouping (and remove it after), I should get the exact labeling I desire, correct? — user26481, Feb 25 '19 at 22:26
@user26481 correct. I also added an answer to show the output. Let me elaborate there. — M--, Feb 25 '19 at 22:27

Rui Barradas · Answer 1 · 2019-02-26T09:57:46.373

2

You can use the knowledge that factors are internally encoded as integers to get the result you want.

First paste together each row's values.

lvls <- apply(df1[-4], 1, paste, collapse = "")

Then coerce to class "factor" and from there to class "integer".

f <- factor(lvls, levels = unique(lvls))
as.integer(f) - 1
#[1] 0 1 2 3 4 5 6

identical(df1$newCol, as.integer(f) - 1)
#[1] TRUE

Data.

df1 <- read.table(text = "
A  |  B  |  C  |  newCol
0  |  0  |  0  |   0
1  |  0  |  0  |   1
0  |  1  |  0  |   2
0  |  0  |  1  |   3
1  |  1  |  0  |   4
0  |  1  |  1  |   5
1  |  1  |  1  |   6                  
", header = TRUE, sep = "|")

edited Feb 26 '19 at 09:57

answered Feb 25 '19 at 22:09

Rui Barradas

70,273
8
34
66

try your solution on this data: `structure(list(A = c(0, 1, 0, 0, 0, 1, 0, 1), B = c(0, 0, 0, 1, 0, 1, 1, 1), C = c(0, 0, 0, 0, 1, 0, 1, 1), newCol = c(0L, 1L, 0L, 2L, 3L, 4L, 5L, 6L)), class = "data.frame", row.names = c(NA, -8L))` I guess there is a bug! :/ – M-- Feb 25 '19 at 23:18
1

`f <- factor(lvls, levels = unique(lvls))` this works! – M-- Feb 25 '19 at 23:48

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

Using data.table package we can preserve the original sorting (recommended):

library(data.table)

setDT(df1)[,new_col:=.GRP-1, by = c("A", "B","C")]

#if you want the column as factor (one-liner, no need for previous line)
setDT(df1)[,new_col:=.GRP-1, by = c("A", "B","C")][,new_col:=as.factor(new_col)]

Using dplyr we can do something like below:

_{(Rui's solution implemented in dplyr with minimal modification to consider possibility of duplicate rows):}

This also preserves the sorting;

df1 %>% mutate(mtemp=paste0(A,B,C)) %>%  
        mutate(new_col = as.integer(factor(mtemp, levels = unique(.$mtemp)))-1) %>% 
        select(-mtemp)

We can use a dummy variable to label the data:

df1 %>% mutate(mtemp = paste0(A,B,C)) %>% 
        group_by(mtemp) %>% arrange(mtemp) %>% ungroup() %>%
        mutate(new_col = c(0,cumsum(lead(mtemp)[-n()] != lag(mtemp)[-1]))) %>% select(-mtemp)

# # A tibble: 8 x 5
#       A     B     C      newCol  new_col
#       <dbl> <dbl> <dbl>  <int>   <dbl>
# 1     0     0     0      0       0
# 2     0     0     0      0       0
# 3     0     0     1      3       1
# 4     0     1     0      2       2
# 5     0     1     1      5       3
# 6     1     0     0      1       4
# 7     1     1     0      4       5
# 8     1     1     1      6       6

or in reference to this thread:

df1 %>% 
  mutate(group_id = group_indices(., paste0(A,B,C)))

Explanation on dplyr solutions:

First solution creates a dummy variable by pasting all three desired variables together; in the next step, each group of that dummy var gets a unique id (compare newCol to new_col). Basically if mtemp changes between any two rows, we get True (its numeric value is 1) as the answer of our comparison (lead(mtemp)...) and then cumsum adds it to the previous id generated which eventually results in different ids for different mtemp (combination of A, B, and C). This solution relies on arrangement of the dummy variable and therefore does not address the sorting requirements.

For the other solution, just read up on ?group_indices.

Data:

df1 <- structure(list(A = c(0, 1, 0, 0, 0, 1, 0, 1), B = c(0, 0, 0, 
1, 0, 1, 1, 1), C = c(0, 0, 0, 0, 1, 0, 1, 1), newCol = c(0L, 
1L, 0L, 2L, 3L, 4L, 5L, 6L)), class = "data.frame", row.names = c(NA, 
-8L))

Thanks. I'm trying to work through it. Does this change the order at all? It's incredibly important to me that it doesn't — user26481, Feb 25 '19 at 22:49
@user26481 first solution relies on putting each group of data into subsequent rows. You can use the second solution if order matters; however, it will not give you ordered indices for each group. To modify first solution so it won't change the order, you can also add another dummy variable with the original row numbers (default sorting) and sort by that variable when you are done with indexing. Then you can get rid of all dummies. — M--, Feb 25 '19 at 23:02
This doesn't work for me: df %>% rownames_to_column() %>% mutate(mtemp = paste0(A, B, C)) %>% group_by(mtemp) %>% arrange(mtemp) %>% ungroup() %>% mutate(group = c(0,cumsum(lead(mtemp)[-n()] != lag(mtemp)[-1]))) %>% arrange(rowname) %>% select(-mtemp, -rowname) — user26481, Feb 25 '19 at 23:09
It isn't ordered properly. I'm adding a dummy encoding at the beginning (Rui Barradas's data) in the hopes of enforcing the order of the labeling (but I see that's not possible under your approach...I was going to jankily fix it later). ------ then I'm looking for it at the end. It's also a way of enforcing all 7 arrangements to be present if they aren't actually in the data. PS: Thank you so much for the help so far! — user26481, Feb 25 '19 at 23:16
@user26481 On Rui's data we don't have much problem actually. I see the problem tho. Thinking about more general cases (what if you had two rows with the same `A, B, C`. Then, I have to give this more thinking (I wanna avoid giving out hacky solutions for special cases). Lemme spend some more time on this. — M--, Feb 25 '19 at 23:22

score 0 · Accepted Answer · answered Feb 26 '19 at 01:40

If it doesn't matter what your new column values are for each unique row, you can use this one liner:

df$newCol <- as.numeric(as.factor(paste(df$a, df$b, df$c, sep = "")))

If you need your new column to be precisely 0-7 for each unique row as specified you can use a two-liner and unique()

df$newCol <- as.factor(paste(df$a, df$b, df$c, sep = ""))
df$newCol <- as.numeric(factor(df$newCol, levels = unique(df$newCol)))-1

If you need this column as a factor then just do this to the result:

df$newCol <- as.factor(df$newCol)

How to label by combination of available data in each row in R

3 Answers3

Data: