0

In R, I have a data frame that looks like this:

         Female.ID    Mate.ID  relatedness
    1           A1         C1       0.0000
    2           A1         D1       0.0000 
    3           A1         E1       0.5062
    4           A1         F1           NA
    5           B1         G1       0.0425
    6           B1         H1       0.0000
    7           B1         I1       0.0349
    8           B1         J1       0.0000
    9           B1         K1       0.0000
    10          B1         L1       0.0887
    11          B1         M1       0.1106
    12          B1         N1       0.0000

I want to create a new dataframe and find the mean relatedness of all the mates for female.ID A1 and the mean relatedness for all the mates of female.ID B1, etc.

I want something like this:

    Female.ID    mean.relatedness
           A1              0.1687
           B1              0.0346

This dataframe is much bigger than this example one, which is why I'm not just subsetting for the female one by one and finding the mean relatedness. I was thinking of doing some kind of for loop, but I'm not sure how to start it off.

Jennifer Diamond
  • 113
  • 2
  • 11
  • You can probably just group, and use `dplyr` to calculate mean. How do you want to handle your NA values. Should they be implicitly zero, or excluded all together? – Mako212 Nov 17 '17 at 17:09
  • I think excluding might be best. Thanks for your help! Looks like crazybilly excluded NA in his code, so I will that out. – Jennifer Diamond Nov 17 '17 at 17:14
  • @akrun you're right, sorry, the mean relatedness value in the first row of my desired dataframe was incorrect. I just edited it. Thanks! – Jennifer Diamond Nov 17 '17 at 17:21
  • 1
    For more references, you can check [here](https://stackoverflow.com/questions/11562656/calculate-mean-per-group-mean-by-group) – akrun Nov 17 '17 at 17:22

2 Answers2

4

You could use dplyr:

library(dplyr)

themeans  <- df %>%
    group_by(Female.ID) %>%
    summarize(mean.relatedness = mean(relatedness, na.rm = T)
crazybilly
  • 2,992
  • 1
  • 16
  • 42
0

The idea is:

  • to do a group by "Female.ID"
  • then summarize using the mean while ignoring the NA.

If the data is too large you may need to use a faster package like data.table (which is a fast package with a simple syntax). for more details please take a look at this link data.table vs dplyr: can one do something well the other can't or does poorly?

In general looping is not optimized in R. It can be kept as a final solution only if the treatment can't be supported by the package.

Here the syntax using data.table (df being the initial data.frame)

library(data.table)

dt<- as.data.table(df)
dt1 <- dt[, .(mean.relatedness= mean(relatedness, na.rm = TRUE)),
            by="Female.ID"]
>dt1
 Female.ID mean.relatedness
1:        A1        0.1687333
2:        B1        0.0345875

note that the grouping-by can be done over a multi-variables vector, the summarizing function can be other than the mean, and na.rm = TRUE is needed to ignore the NA while summarizing.