I wish to select rows in one data frame, data.1
, that partially match rows in a second data frame, keep.these
, to obtain the desired.result
. I have found several questions here that match based on one column, but I want to match on three columns: STATE
, COUNTY
and CITY
. I have come up with three solutions so far, but none seem ideal.
Note that each row contains a unique combination of STATE
, COUNTY
and CITY
in my real data.
When I use merge
I must re-order
. The function match
seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply
solution below is clearly too complex.
The merge
approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match
approach seems okay if someone can confirm this is a reasonable approach.
Is there a better solution, ideally in base R
?
data.1 <- read.table(text= "
CITY COUNTY STATE AA
1 1 1 2
2 1 1 4
1 2 1 6
2 2 1 8
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 3 200
2 1 3 400
1 2 3 600
2 2 3 800
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
1 1 5 20000
2 1 5 40000
1 2 5 60000
2 2 5 80000
", header=TRUE, na.strings=NA)
keep.these <- read.table(text= "
CITY COUNTY STATE BB
1 1 2 -10
2 1 2 -11
1 2 2 -12
2 2 2 -13
1 1 4 -14
2 1 4 -15
1 2 4 -16
2 2 4 -17
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
CITY COUNTY STATE AA
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
", header=TRUE, na.strings=NA)
##########
# this works, but I need to reorder
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))
new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)
##########
# this seems to work, but match is unfamiliar
new.data.2 <- data.1[match(data.1$CITY , keep.these$CITY , nomatch=0) &
match(data.1$STATE , keep.these$STATE , nomatch=0) &
match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]
rownames(desired.result) <- NULL
rownames(new.data.2) <- NULL
all.equal(desired.result, new.data.2)
##########
# this works, but is too complex
data.1b <- data.frame(my.group = apply( data.1[,1:3], 1, paste, collapse = "."), data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)
data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]
rownames(desired.result) <- NULL
rownames(data.1b) <- NULL
all.equal(desired.result, data.1b)
##########