2

I want to know the index of the rows in A that match rows in B.

Both A and B are data frames. For simplicity just assume:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

In this case it's supposed to return 1,2,3.

My full dataset has 500k rows and 18 columns.

zx8754
  • 52,746
  • 12
  • 114
  • 209
j20120307
  • 71
  • 2
  • 8
  • 7
    Do you have a matrix? A data frame? Something else? Please share valid R code to produce your example data, e.g., `A = matrix(c(1, 3, 2, 4), nrow = 2)` or perhaps `A = data.frame(v1 = c(1, 3), v2 = c(2, 4))`. And similarly for `B`. Please edit your question to update it with this information, don't just put it in a comment. If you have other questions about reproducibility, [see here for lots of great ways to make good, reproducible questions](http://stackoverflow.com/q/5963269/903061). – Gregor Thomas Mar 18 '16 at 06:14
  • Maybe `match(paste(B[,1],B[,2],sep="*"),paste(A[,1],A[,2],sep="*"))`. Please provide a sample of your data if you want more-than-a-guess answers. – nicola Mar 18 '16 at 06:26

2 Answers2

2

The join.keys function in the plyr package provides a key to each unique row across a pair of input data frames, which makes it pretty straightforward to determine which rows from A appear in B. In the list returned by join.keys, x is the vector of row identifiers for the first data frame and y is the vector of row identifiers for the second data frame.

library(plyr)
with(join.keys(a1, a2), which(x %in% y))
# [1] 1 2 3
josliber
  • 43,891
  • 12
  • 98
  • 133
  • what if my matrix has more than 2 columns? – j20120307 Mar 18 '16 at 06:36
  • @j20120307 This will work without modification no matter the number of columns. – josliber Mar 18 '16 at 06:37
  • what does row identifier mean here? I have c("V1",...,"V18") as my header for both data frames. Each row is a data point and there are 18 features for each data point. – j20120307 Mar 18 '16 at 06:53
  • @j20120307 If two rows have identical data for all 18 features they will have the same row identifier; otherwise they will have a different row identifier. – josliber Mar 18 '16 at 06:57
  • Thank you this works. What if I want to know the corresponding row number in A that rows in B matches in A? If A=[[1,2],[2,3],[3,6]] and B=[[2,4],[3,6]], then it returns [NA, 3]? – j20120307 Mar 18 '16 at 07:11
  • @j20120307 `with(join.keys(a1, a2), as.numeric(na.omit(match(x, y))))` returns the first matching row for each row of `a1` that actually matched. If you want the first matching row with `NA` values when there is no match, try `with(join.keys(a1, a2), match(x, y))`. – josliber Mar 18 '16 at 17:12
  • Thank you for your help. I managed to handle this by add a new column `c(1:nrow(a1))` at the end of a1, then I reorder the group by setkey on all the other columns and then reorder it back by my newly added column at the end. – j20120307 Mar 19 '16 at 17:20
0

You can use this code :

>subset(a1,a1$a %in% a2$a)

It returns :

>1  a

>2  b

>3  c

If you just want column a, you can add :

>subset(a1,a1$a %in% a2$a,a)

>1

>2

>3

I think it will be fast to do it on your data.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Estelle Duval
  • 313
  • 4
  • 12