Find the row numbers in A that match rows in B

Question

I want to know the index of the rows in A that match rows in B.

Both A and B are data frames. For simplicity just assume:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

In this case it's supposed to return 1,2,3.

My full dataset has 500k rows and 18 columns.

Do you have a matrix? A data frame? Something else? Please share valid R code to produce your example data, e.g., `A = matrix(c(1, 3, 2, 4), nrow = 2)` or perhaps `A = data.frame(v1 = c(1, 3), v2 = c(2, 4))`. And similarly for `B`. Please edit your question to update it with this information, don't just put it in a comment. If you have other questions about reproducibility, [see here for lots of great ways to make good, reproducible questions](http://stackoverflow.com/q/5963269/903061). — Gregor Thomas, Mar 18 '16 at 06:14
Maybe `match(paste(B[,1],B[,2],sep="*"),paste(A[,1],A[,2],sep="*"))`. Please provide a sample of your data if you want more-than-a-guess answers. — nicola, Mar 18 '16 at 06:26

score 2 · Answer 1 · answered Mar 18 '16 at 06:30

2

The join.keys function in the plyr package provides a key to each unique row across a pair of input data frames, which makes it pretty straightforward to determine which rows from A appear in B. In the list returned by join.keys, x is the vector of row identifiers for the first data frame and y is the vector of row identifiers for the second data frame.

library(plyr)
with(join.keys(a1, a2), which(x %in% y))
# [1] 1 2 3

answered Mar 18 '16 at 06:30

josliber

43,891
12
98
133

what if my matrix has more than 2 columns? – j20120307 Mar 18 '16 at 06:36
@j20120307 This will work without modification no matter the number of columns. – josliber Mar 18 '16 at 06:37
what does row identifier mean here? I have c("V1",...,"V18") as my header for both data frames. Each row is a data point and there are 18 features for each data point. – j20120307 Mar 18 '16 at 06:53
@j20120307 If two rows have identical data for all 18 features they will have the same row identifier; otherwise they will have a different row identifier. – josliber Mar 18 '16 at 06:57
Thank you this works. What if I want to know the corresponding row number in A that rows in B matches in A? If A=[[1,2],[2,3],[3,6]] and B=[[2,4],[3,6]], then it returns [NA, 3]? – j20120307 Mar 18 '16 at 07:11
@j20120307 `with(join.keys(a1, a2), as.numeric(na.omit(match(x, y))))` returns the first matching row for each row of `a1` that actually matched. If you want the first matching row with `NA` values when there is no match, try `with(join.keys(a1, a2), match(x, y))`. – josliber Mar 18 '16 at 17:12
Thank you for your help. I managed to handle this by add a new column `c(1:nrow(a1))` at the end of a1, then I reorder the group by setkey on all the other columns and then reorder it back by my newly added column at the end. – j20120307 Mar 19 '16 at 17:20

score 0 · Answer 2 · edited May 02 '16 at 09:47

0

You can use this code :

>subset(a1,a1$a %in% a2$a)

It returns :

>1  a

>2  b

>3  c

If you just want column a, you can add :

>subset(a1,a1$a %in% a2$a,a)

>1

>2

>3

I think it will be fast to do it on your data.

edited May 02 '16 at 09:47

David Arenburg

91,361
17
137
196

answered May 02 '16 at 09:43

Estelle Duval

313
4
12

Find the row numbers in A that match rows in B

2 Answers2