-1

I would like to compare the mean, sd, and percentage CV of two technical duplicates in R.

Currently my data frame looks like this:

library(tidyverse)

data <- tribble(
  ~rowname, ~Sample, ~Phagocytic_Score,
  1,        1232,    24030,
  2,        1232,    11040,
  3,        4321,    7266,
  4,        4321,    4096,
  5,        5631,    7383,
  6,        5631,    21507
)

Created on 2019-10-22 by the reprex package (v0.3.0)

So I would want to compare the values from rows 1 and 2 together, 3 and 4 and so on. With ideally this being stored in a new data frame just with the average score and stats if that makes sense.

Sorry I'm quite new to R so apoplogies if this is really straightforward.

Thanks! Mari

DHW
  • 1,157
  • 1
  • 9
  • 24
Mari
  • 47
  • 4
  • 1
    Possible duplicate of [Dplyr function to compute average, n, sd and standard error](https://stackoverflow.com/questions/44266376/dplyr-function-to-compute-average-n-sd-and-standard-error) – DHW Oct 22 '19 at 15:19
  • Possible duplicate of [How to get summary statistics by group](https://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group) – A. Suliman Oct 22 '19 at 15:20

2 Answers2

1

summarize() can give you exactly this, especially if all the stats you want are computed within groups defined by one variable, i.e. Sample:

library(raster)
#> Loading required package: sp
library(tidyverse)

data <- tribble(
  ~rowname, ~Sample, ~Phagocytic_Score,
  1,        1232,    24030,
  2,        1232,    11040,
  3,        4321,    7266,
  4,        4321,    4096,
  5,        5631,    7383,
  6,        5631,    21507
)

data %>% 
  group_by(Sample) %>% 
  summarize(
    mean   = mean(Phagocytic_Score),
    sd     = sd(Phagocytic_Score),
    pct_cv = cv(Phagocytic_Score)
  )
#> # A tibble: 3 x 4
#>   Sample  mean    sd pct_cv
#>    <dbl> <dbl> <dbl>  <dbl>
#> 1   1232 17535 9185.   52.4
#> 2   4321  5681 2242.   39.5
#> 3   5631 14445 9987.   69.1

We've got some repeating going on, though, don't we? Each variable is defined as a function call with the same input variable. summarize_at() is more appropriate, then:

data %>% 
  group_by(Sample) %>% 
  summarize_at("Phagocytic_Score", 
               list(mean = mean, sd = sd, cv = cv))
#> # A tibble: 3 x 4
#>   Sample  mean    sd    cv
#>    <dbl> <dbl> <dbl> <dbl>
#> 1   1232 17535 9185.  52.4
#> 2   4321  5681 2242.  39.5
#> 3   5631 14445 9987.  69.1

Ah, but there's still some more room for improvement. Why are we repeating the names of the functions as the names of the variables, since they're the same? Well, mget() will take a single vector of the function names we want, and return a named list of those functions, with the names as those function names:

data %>% 
  group_by(Sample) %>% 
  summarize_at("Phagocytic_Score", 
               mget(c("mean", "sd", "cv"), inherits = TRUE))
#> # A tibble: 3 x 4
#>   Sample  mean    sd    cv
#>    <dbl> <dbl> <dbl> <dbl>
#> 1   1232 17535 9185.  52.4
#> 2   4321  5681 2242.  39.5
#> 3   5631 14445 9987.  69.1

Note we need inherits = TRUE for the reason explained here.

Created on 2019-10-22 by the reprex package (v0.3.0)

DHW
  • 1,157
  • 1
  • 9
  • 24
0

If I'm understanding your question, you are looking to summarize your dataframe by grouping based on one of the columns. I assume that in your real data you don't always have exactly two observations of each of your samples.

This approach uses the tidyverse packages, there are other ways to accomplish the same thing

library(tidyverse)
df %>%   # name of your data frame
    group_by(Sample) %>%   This puts all the observations with the same value under "Sample" into groups for subsequent analysis
    summarize(Mean = mean(Phagocytic_Score), 
              SD = sd(Phagocytic_Score),
              PercentCV = SD/Mean # using the sd and mean just calculated for each group
              )
Brian Fisher
  • 1,305
  • 7
  • 17