13

I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:

x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
  • The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
  • The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
  • The val0,val1,val2 columns are something like scores.

The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.

My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)

I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).

Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Nikos
  • 3,267
  • 1
  • 25
  • 32

2 Answers2

17

In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.

require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5]) 
# [1] 1 4 5

If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.

require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
Arun
  • 116,683
  • 26
  • 284
  • 387
  • You weren't joking; that is really fast. – nograpes Jan 21 '14 at 22:51
  • 2
    He Arun, amazing. I use data.table almost exclusively to be honest. One additional question if you don't mind. When I run data.frame::duplist, I get an error [...] not type list. The argument should not be a data.table? Just to clarify, this is EXACTLY the functionality I am looking for. Will accept this answer once I have your comment about my question. Thanks – Nikos Jan 22 '14 at 09:13
  • 3
    Nikos, `uniqlist`/`duplist` is a function in `data.table` package. And because it's not exported, you'll have to use the `data.table:::` (where the `data.table` here refers to the package) operator to access the function. And it'll operate basically on data.frames/data.tables (as it expects list input). Hope this clears things up? – Arun Jan 22 '14 at 10:13
  • Sorry, I typed data.FRAME::duplist. I mean data.table::duplist. You've been extremely helpful. Thank you – Nikos Jan 22 '14 at 10:22
  • 1
    Sure. It basically gives you the index at every unique combination. More like `rle`, but works with lists, but returns indices instead of values. – Arun Jan 22 '14 at 10:26
  • @Arun Your solution works perfectly, which makes me wonder how one could learn about these functionality? I googled data.table duplist and it's not even in the docs. Had I not seen your answer, it's impossible to know! – Heisenberg Oct 19 '14 at 19:18
  • @Heisenberg, that's because the function is not exported for use (yet). Maybe you could file an issue on our github page? We'll try to export it as soon as we can. Thank you. – Arun Oct 19 '14 at 19:25
  • 2
    I just realized that you're a `data.table` developer -- no wonder! [Issue filed](https://github.com/arunsrinivasan/datatable/issues/3). – Heisenberg Oct 19 '14 at 19:31
  • @Arun, how stable is this likely to be? I'd like to put something using `data.table:::uniqlist` into production code, but I understand that this is frowned upon as I am of course not also a `data.table` developer and thus that method could change. (See here for frowns: https://stat.ethz.ch/pipermail/r-devel/2013-August/thread.html#67180) – ijoseph Aug 02 '16 at 16:39
  • @Arun - FYI I used `data.table:::uniqlist()` for a fast solution in this answer https://stackoverflow.com/a/76585097/10276092 – M.Viking Jul 03 '23 at 16:08
3

Totally unreadable, but:

c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5

Basically, run diff on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.

Also, without the sapply:

c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • 1
    +1, Note to OP you'll then need to subset the data.frame with the above expressions (presumably obvious, but just in case...). – BrodieG Jan 21 '14 at 19:05
  • nograpes, thank you. Works perfectly. It is little bit slow though when run on 2M lines. I am afraid that I am leaning on the answer from Arun (for a number of reasons). Thank you for your time. – Nikos Jan 22 '14 at 09:32