1

I'm stuck on the following problem in R and was hoping someone had a quick solution.

I have two sets of data, A and B, where A contains data for a control group and B a case group. I have measures for the same variables for each group. Within A and B are subgroups - and they are in some instances paired between A and B - let's say they are siblings where one or more can be a case and one or more a control.
The data look something like this:

SET A:

Source  Area    group   pch pch2    col col2    group2  
R1-1    1983447     1   0     16    1      1    1   
R1-3    1400362     1   0     16    1      1    1
R3-4    2834393     2   1     16    2      2    1
R4-2    2232820     3   2     16    3      3    1   
R4-5    1713796     3   2     16    3      3    1   
R4-6    1525740     3   2     16    3      3    1   
R4-7    1182300     3   2     16    3      3    1   

SET B:

Source  Area    group   pch pch2    col col2    group2
R1-2    1246124     1   0     16    1      1    2
R3-1    1627610     2   1     16    2      2    2
R3-2    1401600     2   1     16    2      2    2
R4-1    1367146     3   2     16    3      3    2
R4-3    1764125     3   2     16    3      3    2
R4-4    1299864     3   2     16    3      3    2

Source is ID, Area is the variable of interest, group is group, and the rest are additional variables that are not of interest here.
What I'd like to do is calculate relative Area for each of the individuals in set B - i.e., relative to mean Area of their siblings in Set A. I'd like this value to appear as a seperate column in set B (under relArea in sample below). The output would therefore look like this:

Output (Set B):

Source  Area    group   relArea pch pch2    col col2    group2
R1-2    1246124   1 0.736521476   0 16        1    1    2
R3-1    1627610   2 0.574235824   1 16        2    2    2
R3-2    1401600   2 0.494497411   1 16        2    2    2
R4-1    1367146   3 0.821768097   2 16        3    3    2
R4-3    1764125   3 1.06038539    2 16        3    3    2
R4-4    1299864   3 0.781326037   2 16        3    3    2

Finally, if an individual in set B does not have a sibling in set A, then his relArea value would be the Area relative to average Area of all the controls (i.e., all measurements in set A).

Any help with this would be much appreciated.

thanks,
Bjorn

1 Answers1

3

You could compute the average area per group in Set A with aggregate and then add your new column:

seta = read.table(text="Source  Area    group   pch pch2    col col2    group2  
  R1-1    1983447     1   0     16    1      1    1   
  R1-3    1400362     1   0     16    1      1    1
  R3-4    2834393     2   1     16    2      2    1
  R4-2    2232820     3   2     16    3      3    1   
  R4-5    1713796     3   2     16    3      3    1   
  R4-6    1525740     3   2     16    3      3    1   
  R4-7    1182300     3   2     16    3      3    1  ", header=T)
setb = read.table(text="Source  Area    group   pch pch2    col col2    group2
  R1-2    1246124     1   0     16    1      1    2
  R3-1    1627610     2   1     16    2      2    2
  R3-2    1401600     2   1     16    2      2    2
  R4-1    1367146     3   2     16    3      3    2
  R4-3    1764125     3   2     16    3      3    2
  R4-4    1299864     3   2     16    3      3    2", header=T)
grouped.area = aggregate(seta$Area, by=list(group=seta$group), mean)
setb$relArea = setb$Area / grouped.area$x[match(setb$group, grouped.area$group)]
setb$relArea
# [1] 0.7365215 0.5742358 0.4944974 0.8217681 1.0603854 0.7813260
josliber
  • 43,891
  • 12
  • 98
  • 133
  • Excellent, thanks for the help. This works like a charm. I had to add one command line to compute values for the individuals that had no siblings in setA: is.na(setB$relArea) <- setB$Area / mean(setA$Area) – bjornlovesR Feb 11 '14 at 08:51
  • Actually, that's not correct. I've added this command: setb$relArea <- ifelse(is.na(setb$relArea), setb$Area / mean(seta$Area), setb$relArea). Perhaps not elegant, but it works. – bjornlovesR Feb 11 '14 at 15:19