2

Here's what I have:

data.frame(x=c(0,0,0,1,1,1), y=c(0,0,1,0,1,1))

  x y
1 0 0
2 0 0
3 0 1
4 1 0
5 1 1
6 1 1

Here's what I want:

data.frame(x=c(0,0,0,1,1,1), y=c(0,0,1,0,1,1), pattern=c(1,1,2,3,4,4))

  x y pattern
1 0 0       1
2 0 0       1
3 0 1       2
4 1 0       3
5 1 1       4
6 1 1       4

That is, I have a bunch of columns (not just two), and thousands of rows. I want to go through each row, figure out what the distinct combinations of x, y, z, etc. are, call each one a distinct pattern, and return that pattern for each row.

(Context: I have gene expression data for several genes over many time points. I want to try to see which genes oscillate similarly over time by defining patterns based on whether something's up or down-regulated at any particular time point).

Thanks.

Stephen Turner
  • 2,574
  • 8
  • 31
  • 44
  • Also, I'd be happy for someone to edit the title of this question to make it easier to find in the future by searching. Wasn't really sure how to best ask. – Stephen Turner Dec 19 '16 at 21:22
  • Also, bonus points for staying in the tidyverse. I got the x, y, z, etc time point values from spreading a long-form tidy dataset with one entry per gene per week. Maybe it would be better to start from the gathered dataset anyway(?) – Stephen Turner Dec 19 '16 at 21:24

3 Answers3

7

You can use dplyr::group_indices():

NSE version

group_indices(df, x, y)
# [1] 1 1 2 3 4 4

SE version

group_indices_(df, .dots = names(df))
# [1] 1 1 2 3 4 4

The unfortunate side of this function is that it doesn't work with mutate function (yet), so you have to use it as:

df$pattern <- group_indices(df, x, y)

From the linked answer, it seems that even though the non-standard evaluation version doesn't work with mutate, the standard evaluation version does:

df %>% mutate(pattern = group_indices_(df, .dots = c('x', 'y')))
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • This is great. How to generalize to (many) more than two columns x and y? – Stephen Turner Dec 20 '16 at 14:17
  • 1
    You can put the column names in a vector, `cols = c("x", "y", "z", ...etc)` and then pass it to `.dots` parameter. `df %>% mutate(pattern = group_indices_(df, .dots = cols))` – Psidom Dec 20 '16 at 14:23
5

In base we can paste together the relevant columns, convert them to character, and then get the integer code of the factor:

as.numeric(as.factor(paste(df$x,'_',df$y)))

for the data above it is half the speed of the dplyr solution (unclear how it will scale):

microbenchmark::microbenchmark(as.numeric(as.factor(paste(z$x,'_',z$y))), group_indices(df, x, y))
Unit: microseconds
                                        expr     min       lq     mean  median       uq     max neval cld
 as.numeric(as.factor(paste(df$x, "_", df$y))) 150.913 153.9855 162.5637 159.745 165.8890 258.817   100  a 
                     group_indices(df, x, y) 322.945 327.3610 339.4574 337.922 340.4175 567.938   100   b
jeremycg
  • 24,657
  • 5
  • 63
  • 74
1

Use rleid in data.table.

setDT(df)[,pattern:=rleid(x,y)]
Shenglin Chen
  • 4,504
  • 11
  • 11