5

I have two data frames with a range of cases. One from time 1 and one from time 2. I am looking for a way to quickly identify cases where changes have occurred between time1 and time 2 and I am kinda stuck.

Here is an example. So, I have a data frame from time 1,

df.t1 <- data.frame(id = c(1,1,1,2,2,3,3,5,5,6), ABC = LETTERS[1:10], Num = 101:110)

and it looks like this,

df.t1
   id ABC Num
1   1   A 101
2   1   B 102
3   1   C 103
4   2   D 104
5   2   E 105
6   3   F 106
7   3   G 107
8   5   H 108
9   5   I 109
10  6   J 110

time two rolls around

df.t2 <- df.t1

and some changes occur,

df.t2[3,3] <- 104
df.t2[2,2] <- "H"
df.t2[8,3] <- 999
df.t2[10,3] <- NA
df.t2[11,] <- c(3, "J", 107)

this is time 2,

df.t2
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
4   2   D  104
5   2   E  105
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

I'm now looking for a quick way to remove cases, all id's, for cases where NO changes have occurred in the case (any row) between time1 and time 2. In the specific example it is only with id # 2 that no changes occurred between time 1 and time 2.

I am looking for an end result that looks like this,

(df <- subset(df.t2, id != 2))
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

any help would be appreciated.

Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • At `id=3` there is a change (a new entry). You don't consider them? – Arun Feb 14 '13 at 00:30
  • 1
    @Arun, my bad. I reversed the point, I'll update the question. Thanks! – Eric Fail Feb 14 '13 at 00:32
  • To add some extra clarification, I want to keep cases in `df` where changes have occurred between time 1 and time 2. Please let me know if it's still unclear! – Eric Fail Feb 14 '13 at 00:38

5 Answers5

3

I think this would work. A solution using data.table:

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC")
setkey(dt2, "id", "ABC")
dt <- dt1[dt2]
dt2[id %in% dt[, !(all(Num == Num.1)) | any(is.na(c(Num, Num.1))), by=id][V1 == TRUE]$id]

#    id ABC Num
# 1:  1   A 101
# 2:  1   C 104
# 3:  1   H 102
# 4:  3   F 106
# 5:  3   G 107
# 6:  3   J 107
# 7:  5   H 999
# 8:  5   I 109
# 9:  6   J  NA

Alternatively, after obtaining dt = dt1[dt2]:

dt2[id %in% dt[, ctr := Num %in% Num.1, by=1:nrow(dt)][ctr == FALSE, unique(id)]]
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks! I cold never have come up with that on my own. Again, the issue is that the real data have a lot of columns and it seems as I would have to name them all in this function. Regardless, you answer my question. – Eric Fail Feb 14 '13 at 00:58
  • So, you would like "any" of the columns that are changed to be selected from `df.t2`?? If that's the case, I'll edit this post later during the day. – Arun Feb 14 '13 at 01:07
  • What I am looking for is any row that is changed between time 1 and time 2, and if a row is changed I want all rows relating to that id. I am trying to illustrate this in the last part of my question. – Eric Fail Feb 14 '13 at 01:30
3

What about using data.table and keying by all of id, ABC and NUM

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC", "Num")
setkey(dt2, "id", "ABC", "Num")
# then it is just

dt2[dt2[!dt1][,list(unique(id))]]


   id ABC Num
1:  1   A 101
2:  1   C 104
3:  1   H 102
4:  3   F 106
5:  3   G 107
6:  3   J 107
7:  5   H 999
8:  5   I 109
9:  6   J  NA

This uses a non-join, between dt2 and dt1, selects the unique id values from these, and then subsets the dt2 data appropriately.

with a caveat about NA values in keys being problematic....

mnel
  • 113,303
  • 27
  • 265
  • 254
  • Yes, but that is a merge. He needs those that are not equal between the two. For ex: you get here `id=2` which shouldn't be there. – Arun Feb 14 '13 at 01:14
  • Good point @Arun, i've had a rethink, and edit. I can't make the `NA` cause an issue, but it might in certain situations. – mnel Feb 14 '13 at 01:35
  • (+1) Brilliant! I came "close" with dt2[!dt1] but dint think of this!! I was sure there must be an equivalent of `setdiff`(for vectors) in data.table. – Arun Feb 14 '13 at 01:55
  • @mnel, does this entail that I would need to set all 34 column names as `setkey` in my real data? – Eric Fail Feb 14 '13 at 01:56
  • Yes, you could do `setkeyv(DT, names(DT))` to avoid having to spell them all out (and ensuring that id is column number 1) – mnel Feb 14 '13 at 01:58
2

Try this:

df.t1$interact <- with(df.t1, interaction(id, ABC, Num))
df.t2$interact <- with(df.t2, interaction(id, ABC, Num))

change.ids <- unique(df.t2$id[!df.t2$interact %in% df.t1$interact])
new.df <- df.t2[df.t2$id %in% change.ids,]
alexwhan
  • 15,636
  • 5
  • 52
  • 66
  • Thank you for responding to my question. It seems to work after your update, only problem is that I have to define `id`, `ABC`, and `Num` in `interaction()`, but I guess I can use `names()`. – Eric Fail Feb 14 '13 at 00:46
  • What do you want to happen? How is your data structured? – alexwhan Feb 14 '13 at 00:47
  • You do solve the problem, the issue is that in the real data I have 34 columns and some of them have some text in them so pasting it together to i `interact` might not be the most optimal solution. Can you think of a way where I can compare it row by row? – Eric Fail Feb 14 '13 at 00:52
  • If it's manually naming the columns you're worried about, this might work: `df.t1$interact <-with(df.t1, interaction((df.t1[,1:ncol(df.t1)])))` but it's gonna be ugly – alexwhan Feb 14 '13 at 01:20
  • I like the `interaction((df.t1[,1:ncol(df.t1)])). It might be _ugly_, but it makes it more flexible. – Eric Fail Feb 14 '13 at 01:26
  • I'm trying this solution, but it's killing my computer. I believe it's the combining everything into one variable. I think I need a solution where i do it row by row. – Eric Fail Feb 14 '13 at 01:43
  • OK, have a go at the looping idea I posted below – alexwhan Feb 14 '13 at 01:46
  • Is it faster if you use `paste` instead of `interaction`? `interaction` is making many more factor levels than you need. – Aaron left Stack Overflow Feb 14 '13 at 03:28
  • Also, perhaps generating a unique number would be faster? See my attempt here: http://stackoverflow.com/a/14867349/210673 – Aaron left Stack Overflow Feb 14 '13 at 03:32
2

I really liked @alexwhan's answer here https://stackoverflow.com/a/14865931/210673, but it sounded like the combining was slow because of many columns and long character strings. I wonder if numerically getting a unique number for each combination would be faster.

# get a matrix of unique integers for each column (stacking the two data frames)
ms <- do.call(cbind, lapply(seq_len(ncol(df.t1)), function(ni) {
  xi <- c(as.character(df.t1[[ni]]), as.character(df.t2[[ni]]))
  match(xi, unique(xi))
}))
# convert to base max(ms) to get a single unique identifying number
us <- as.vector(ms %*% max(ms)^c(0:(ncol(ms)-1)))
u2 <- us[(nrow(df.t1)+1):length(us)]
u1 <- us[1:nrow(df.t1)]
# now get changed values and proceed as in alexwhan's answer
ch <- unique(df.t2$id[! u2 %in% u1])
df.t2[df.t2$id %in% ch,]

Here's a slightly different way to get the us variable, I expect it's a little slower but takes more care to keep everything as integers instead of floating-point numerics so uniqueness should be guaranteed and I believe that any floating-point overflow will cause a warning. (I also subtract one from ms as it's still unique and everything is just a little smaller.)

base <- as.integer(max(ms)^c(0:(nrow(ms)-1)))
us <- apply((ms-1L) * base, 2, sum)
Community
  • 1
  • 1
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • This solution is the first that actually works with my data. Very grateful that you found this solution. I don't fully understand it yet, but I appreciate that it's there. Thanks! – Eric Fail Feb 14 '13 at 05:18
  • Especially with R, a good solution depends on the size of your data set, so while coming up with a small reproducible example is always recommended, it can also really help to state the size of the data set; in this case, not only how many rows and columns, but also how many ids and how many unique things are in each column. (That last would perhaps affect my solution; I worry a little bit about integer overflow.) – Aaron left Stack Overflow Feb 14 '13 at 14:02
  • Thank you for following up with some really sound advice. I will add that information to my post later today (and in future questions). Do you have any good sources where I can learn more about _integer overflow_? – Eric Fail Feb 14 '13 at 14:39
  • http://stackoverflow.com/questions/8804779/what-is-integer-overflow-in-r-and-how-can-it-happen – Aaron left Stack Overflow Feb 14 '13 at 14:45
  • PS. Did @alexwhan's answer work better with `paste` instead of `interact`? If that works this complication is unnecessary... – Aaron left Stack Overflow Feb 14 '13 at 15:52
  • The paste doesn't really work in my case. Your tweak of `us` did not work for me, so I am using your solution as is–and so far it works great, but I think I should keep looking for better solution jf. your waring about _integer overflow_. – Eric Fail Feb 14 '13 at 20:07
  • Well, I don't think anyone here can help with a better solution unless we know more about your specific requirements. I'd suggest creating a data set with random data (using `set.seed`) that is similar to yours, and starting a new question, referencing this one and asking for ideas on what to do given the larger data set. – Aaron left Stack Overflow Feb 14 '13 at 20:23
  • That is an excellent advice. I'll do that ASAP. Thank you for all your help and advice. – Eric Fail Feb 14 '13 at 20:41
1

OK, here's an alternative with some looping. I haven't tried it on a more complex example, but have a look:

no.change <- vector()
#identifies rows which don't change
for(i in 1:nrow(df.t2)) { 
    for(k in 1:nrow(df.t1)) {
    if(all(df.t2[i,]==df.t1[k,],na.rm=TRUE)) #na.rm gets rid of error
   no.change <- c(no.change, i)
  }
}
# gets ids from all the other rows
id.change <- df.t2$id[-no.change]
df <- df.t2[df.t2$id %in% id.change,]
alexwhan
  • 15,636
  • 5
  • 52
  • 66
  • Thanks for giving it another go! – Eric Fail Feb 14 '13 at 01:49
  • It get this error `Error in if (all(df.t2[i, ] == df.t1[k, ])) no.change <- c(no.change, : missing value where TRUE/FALSE needed` – Eric Fail Feb 14 '13 at 01:50
  • 1
    This falls into the [2nd circle of hell](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf). Never grow objects in that way! – mnel Feb 14 '13 at 01:53
  • I know I know (you should see some of the truuuuly ugly stuff) - all too often "it just works for right now" makes the grade – alexwhan Feb 14 '13 at 01:54
  • @alexwhan, but … well. I don't know how I feel about solving something with code that returns an error. Regardless, you are right. – Eric Fail Feb 14 '13 at 01:58