4

This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.

I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set

Kerry
  • 793
  • 14
  • 33
  • Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. – Arun Apr 23 '13 at 08:52
  • 1
    @Arun I didn't think it was that complicated of a question, thus i didn't think I would need to add reproducible data. Since Ive no code to do this, Im not sure what would be necessary to make reproducible. – Kerry Apr 23 '13 at 08:55
  • 1
    The question isn't complicated. But I feel it's your responsibility to provide data for others to work on. Imagine answering many not-so-complicated-questions and each one of those who's trying to answer creating data by themselves for every question. It's just easier if the OP provides the data. – Arun Apr 23 '13 at 09:11
  • @Arun +1, otherwise we have to assume a lot of things! – Nishanth Apr 23 '13 at 09:12
  • 1
    @Arun I was attempting to add sample data to my question when others had already done so. I will always provide sample data from now on, even if Ive no code to help. :) – Kerry Apr 23 '13 at 10:01

5 Answers5

7

Made same sample data.

long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))

You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.

long$sh<-as.numeric(long$UniqID %in% short$UniqID)
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
  • Perfect, and thank you for generating sample data. I apologize for not doing something like this in my question. – Kerry Apr 23 '13 at 09:56
7

I'll use @AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.

Case 1: Has unique id column

set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID", 
            drop=FALSE])[-seq_len(nrow(df2))])

Case 2: Has no unique id column

set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 3
    Very nice +1. @Kerry, what wonders occur when an easily reproducible example is shared! :) – A5C1D2H2I1M1N2O1R2T1 Apr 23 '13 at 09:36
  • Excellent improvement so that way more people can utilize this because often there isn't one unique ID column. – Kerry Apr 23 '13 at 09:58
  • Of course, a very simple alternative is to just create a unique ID prior to drawing row sample :) Still, there are practical applications e.g. when the sample was drawn by someone else etc. Thanks for adding the extra solution, both @AnandaMahto and @Arun! – Maxim.K Apr 23 '13 at 10:31
6

The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?

At that point, perhaps merge can be of assistance:

Here's an example using merge and %in% where an ID is available:

set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]

temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)

And, a similar example where an ID isn't available.

set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]

temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

You can directly use the logical vector as a new column:

long$Indicator <- 1*(long$UniqID %in% short$UniqID)
Nishanth
  • 6,932
  • 5
  • 26
  • 38
0

See if this can get you started:

long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.

long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
  UniqID indicator
1     87      TRUE
2     15      TRUE
3    100      TRUE
4     40     FALSE
5     89     FALSE
6     21     FALSE
zelite
  • 1,478
  • 16
  • 37