Matching data from unequal length data frames in r

Question

This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.

I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set

Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. — Arun, Apr 23 '13 at 08:52
@Arun I didn't think it was that complicated of a question, thus i didn't think I would need to add reproducible data. Since Ive no code to do this, Im not sure what would be necessary to make reproducible. — Kerry, Apr 23 '13 at 08:55
The question isn't complicated. But I feel it's your responsibility to provide data for others to work on. Imagine answering many not-so-complicated-questions and each one of those who's trying to answer creating data by themselves for every question. It's just easier if the OP provides the data. — Arun, Apr 23 '13 at 09:11
@Arun I was attempting to add sample data to my question when others had already done so. I will always provide sample data from now on, even if Ive no code to help. :) — Kerry, Apr 23 '13 at 10:01

Didzis Elferts · Accepted Answer · 2013-04-23T09:25:50.483

7

Made same sample data.

long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))

You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.

long$sh<-as.numeric(long$UniqID %in% short$UniqID)

edited Apr 23 '13 at 09:25

answered Apr 23 '13 at 08:56

Didzis Elferts

95,661
14
264
201

Perfect, and thank you for generating sample data. I apologize for not doing something like this in my question. – Kerry Apr 23 '13 at 09:56

score 7 · Answer 2 · answered Apr 23 '13 at 09:34

7

I'll use @AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.

Case 1: Has unique id column

set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID", 
            drop=FALSE])[-seq_len(nrow(df2))])

Case 2: Has no unique id column

set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])

answered Apr 23 '13 at 09:34

Arun

116,683
26
284
387

3

Very nice +1. @Kerry, what wonders occur when an easily reproducible example is shared! :) – A5C1D2H2I1M1N2O1R2T1 Apr 23 '13 at 09:36
Excellent improvement so that way more people can utilize this because often there isn't one unique ID column. – Kerry Apr 23 '13 at 09:58
Of course, a very simple alternative is to just create a unique ID prior to drawing row sample :) Still, there are practical applications e.g. when the sample was drawn by someone else etc. Thanks for adding the extra solution, both @AnandaMahto and @Arun! – Maxim.K Apr 23 '13 at 10:31

A5C1D2H2I1M1N2O1R2T1 · Answer 3 · 2013-04-23T09:24:30.620

The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?

At that point, perhaps merge can be of assistance:

Here's an example using merge and %in% where an ID is available:

set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]

temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)

And, a similar example where an ID isn't available.

set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]

temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)

score 4 · Answer 4 · answered Apr 23 '13 at 08:56

4

You can directly use the logical vector as a new column:

long$Indicator <- 1*(long$UniqID %in% short$UniqID)

answered Apr 23 '13 at 08:56

Nishanth

6,932
5
26
38

score 0 · Answer 5 · answered Apr 23 '13 at 09:03

See if this can get you started:

long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.

long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
  UniqID indicator
1     87      TRUE
2     15      TRUE
3    100      TRUE
4     40     FALSE
5     89     FALSE
6     21     FALSE

Matching data from unequal length data frames in r

5 Answers5

Case 1: Has unique id column

Case 2: Has no unique id column