I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.
For example my data looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.
So far for the average score per group, I am using
group_average_score <- aggregate( Score ~ Group, df, mean )
Although I am struggling to get this added as an additional column in the data.
Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:
test <- df %>%
group_by(Group) %>%
mutate(Diff = c(NA, diff(Score)))
But I'm not sure this is calculating the average variation out of all gene's Score
per group. The output using my real data gives a couple different variation average scores per group when there should be just one.
Expected output should look something like:
Group Gene Score direct_count secondary_count Average_Score Average_Score_Difference
1 AQP11 0.5566507 4 5 0.46160593 0.183650
1 CLNS1A 0.2811747 0 2 0.46160593 0.183650
1 RSF1 0.5469924 3 6 0.46160593 0.183650
2 CFDP1 0.4186066 1 2 ... ...
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I think the Average_Score_Difference
is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))