How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

Question

I have a twin-dataset, in which there is one column called wpsum, another column is family-id, which is the same for corresponding twin pairs.

        wpsum    family-id
twin 1     14          220    
twin 2     18          220

I want to calculate the correlation between wpsumof those with the same family-id, while there are also some single family id's, if one twin did not take part in the re-survey. family-id is a character.

[Could you add some data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) similar to yours(also fake data) that we can copy and paste in r without further mods? — s__, Aug 17 '18 at 11:41
Since there's never more than two observations within each family ID group, they can't be meaningfully correlated - you have too few degrees of freedom even for a linear regression. In this case, the result will be perfect prediction, i.e. a regression line drawn directly through both datapoints. Or are you referring to a different kind of analysis? — DHW, Aug 17 '18 at 11:51
Note that I'm assuming you have something else to correlate `wpsum` with, within the family groups, in the first place. — DHW, Aug 17 '18 at 12:06
@DHW The `wpsum`is the sum of the Wilson Patterson-Index, measuring political ideology. It ranges from -20 (very liberal) to +20 (very conservative). I want to calculate the correlation of that index between each twin pair. Then I compare the average intertwinpair-correlation of monozygotic twins to the average intertwinpair-correlation of dizygotic twins to see if heritable factors play a role in ideology. Does that make sense? — Jana, Aug 17 '18 at 13:05
@Jana I'm a political scientist myself. Whether an individual is part of a DZ or MZ pair is your necessary third variable. I'm guessing your theory is that MZ twins have more in common, including their ideology. So then you need to start with the differences as per my answer below, but add a variable for whether the family group consists of DZ or MZ pairs. Then use pair type to predict the differences, i.e. compare the two group's means of ideological distance. Your unit of analysis needs to be twin-pairs. I'd rewrite the question accordingly, though I'll leave my answer as-is for now. — DHW, Aug 17 '18 at 13:23

DHW · Accepted Answer · 2018-08-17T13:41:59.703

0

There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):

dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1), 
              family_id = c("220","220","221","221","222","222","223"))
dat
  wpsum family_id
1    14       220
2    18       220
3    20       221
4     5       221
5    10       222
6    NA       222
7     1       223

diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------ 
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA

You can make a data.frame with this new variable of differences like so:

diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
  diffs family_id
1     4       220
2    15       221
3    NA       222
4    NA       223

Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.

edited Aug 17 '18 at 13:41

answered Aug 17 '18 at 12:36

DHW

1,157
1
9
24

thank you, I did that but it gives me 25 everywhere `-----------------------------------------------------------------------` `identicalwp$FAMID.y: 55713` `[1] 25` `-----------------------------------------------------------------------` `identicalwp$FAMID.y: 55714` `[1] 25` – Jana Aug 17 '18 at 13:39
Forgot to add the additional `wpsum` value in my answer. But that doesn't sound like it's your issue. Probably something with how your real data are coded as compared to this example. Think about changing the `abs(x$wpsum[1] - x$wpsum[2])` function. It needs to reference the two different individuals' scores within each twin pair, where x is the dataset restricted to just two (at most) observations at a time. – DHW Aug 17 '18 at 13:43
could it be because some `wpsum` is negative and some positive? In that case I could just change the numerical levels of the dataframe? – Jana Aug 17 '18 at 13:50
Nope, absolute value of the difference would account for that. Getting the same value every time has to be because the `by` function isn't correctly referencing the two different individuals' values within each separate pair. Maybe think about what it is referencing, to produce that particular value every time. – DHW Aug 17 '18 at 13:53

How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

1 Answers1