R: define distinct pattern from values of multiple variables

Question

Here's what I have:

data.frame(x=c(0,0,0,1,1,1), y=c(0,0,1,0,1,1))

  x y
1 0 0
2 0 0
3 0 1
4 1 0
5 1 1
6 1 1

Here's what I want:

data.frame(x=c(0,0,0,1,1,1), y=c(0,0,1,0,1,1), pattern=c(1,1,2,3,4,4))

  x y pattern
1 0 0       1
2 0 0       1
3 0 1       2
4 1 0       3
5 1 1       4
6 1 1       4

That is, I have a bunch of columns (not just two), and thousands of rows. I want to go through each row, figure out what the distinct combinations of x, y, z, etc. are, call each one a distinct pattern, and return that pattern for each row.

(Context: I have gene expression data for several genes over many time points. I want to try to see which genes oscillate similarly over time by defining patterns based on whether something's up or down-regulated at any particular time point).

Thanks.

Also, I'd be happy for someone to edit the title of this question to make it easier to find in the future by searching. Wasn't really sure how to best ask. — Stephen Turner, Dec 19 '16 at 21:22
Also, bonus points for staying in the tidyverse. I got the x, y, z, etc time point values from spreading a long-form tidy dataset with one entry per gene per week. Maybe it would be better to start from the gathered dataset anyway(?) — Stephen Turner, Dec 19 '16 at 21:24

Psidom · Accepted Answer · 2016-12-19T22:03:01.380

7

You can use dplyr::group_indices():

NSE version

group_indices(df, x, y)
# [1] 1 1 2 3 4 4

SE version

group_indices_(df, .dots = names(df))
# [1] 1 1 2 3 4 4

The unfortunate side of this function is that it doesn't work with mutate function (yet), so you have to use it as:

df$pattern <- group_indices(df, x, y)

From the linked answer, it seems that even though the non-standard evaluation version doesn't work with mutate, the standard evaluation version does:

df %>% mutate(pattern = group_indices_(df, .dots = c('x', 'y')))

edited Dec 19 '16 at 22:03

answered Dec 19 '16 at 21:24

Psidom

209,562
33
339
356

This is great. How to generalize to (many) more than two columns x and y? – Stephen Turner Dec 20 '16 at 14:17
1

You can put the column names in a vector, `cols = c("x", "y", "z", ...etc)` and then pass it to `.dots` parameter. `df %>% mutate(pattern = group_indices_(df, .dots = cols))` – Psidom Dec 20 '16 at 14:23

score 5 · Answer 2 · answered Dec 19 '16 at 21:36

In base we can paste together the relevant columns, convert them to character, and then get the integer code of the factor:

as.numeric(as.factor(paste(df$x,'_',df$y)))

for the data above it is half the speed of the dplyr solution (unclear how it will scale):

microbenchmark::microbenchmark(as.numeric(as.factor(paste(z$x,'_',z$y))), group_indices(df, x, y))
Unit: microseconds
                                        expr     min       lq     mean  median       uq     max neval cld
 as.numeric(as.factor(paste(df$x, "_", df$y))) 150.913 153.9855 162.5637 159.745 165.8890 258.817   100  a 
                     group_indices(df, x, y) 322.945 327.3610 339.4574 337.922 340.4175 567.938   100   b

Need to generalize this to >2 columns – sirallen Dec 19 '16 at 21:36 — sirallen, Dec 19 '16 at 21:36
@sirallen ; `as.numeric(factor(do.call(paste, d)))` – user20650 Dec 19 '16 at 21:38 — user20650, Dec 19 '16 at 21:38

score 1 · Answer 3 · answered Dec 19 '16 at 22:07

1

Use rleid in data.table.

setDT(df)[,pattern:=rleid(x,y)]

answered Dec 19 '16 at 22:07

Shenglin Chen

4,504
11
11

R: define distinct pattern from values of multiple variables

3 Answers3

Linked

Related