I'm collecting survey data (using open data kit), and my field team, bless their hearts, sometimes get a bit creative with the spelling of people's names. So I have a "correct" respondent name, as well as an age variable for some of the records that is linked to a "family member name" variable. There are many family members with different ages. I want respondent age.
Here is some fake data that illustrates my problem:
#the respondent
r = data.frame(name = c("Barack Obama", "George Bush", "Hillary Clinton"))
#a male member
m = data.frame(name = c("Barack Obama","George", "Wulliam Clenton"), age = c(55,59,70)); m$name=as.character(m$name)
#a female member
f = data.frame(name = c("Michelle O","Laura Busch", "Hillary Rodham Clinton"), age = c(54,58,69)); f$name=as.character(f$name)
#if the responsent is the the given member, record their age. if not, NA
a = cbind(
ifelse(r$name==m$name,m$age,NA)
,ifelse(r$name==f$name,f$age,NA)
)
#make a function for plyr that gives me the age of the matched respondent
f = function(row){
d = row[is.na(row)==0]
ifelse(length(d)==0,NA,d)
}
require(plyr)
b = aaply(a,.margins=1,.fun=f)
data.frame(names=r$name,age=b)
names age
1 Barack Obama 55
2 George Bush NA
3 Hillary Clinton NA
what.I.would.like = data.frame(names=c("Barack Obama", "George Bush", "Hillary Clinton"),age = c(55,59,70))
1> what.I.would.like
names age
1 Barack Obama 55
2 George Bush 59
3 Hillary Clinton 70
in my real data, I've got hundreds of people and up to 13 family members. I've since changed the survey to record respondent age separately, but I've got a mess of data to clean.