There's no correlation between wpsum
of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum
within the family-id
groups (see my comment), but you can get the difference in wpsum
scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):
dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1),
family_id = c("220","220","221","221","222","222","223"))
dat
wpsum family_id
1 14 220
2 18 220
3 20 221
4 5 221
5 10 222
6 NA 222
7 1 223
diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA
You can make a data.frame with this new variable of differences like so:
diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
diffs family_id
1 4 220
2 15 221
3 NA 222
4 NA 223
Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.