How to create a R function to normalize data to sum up to 100%

Question

I have average percent cover for each functional group according to Year, Month, Site, and Treatment (see photo). These functional group average values do not sum up to 100% for each treatment group (sorted by year, month, and site). I would like to normalize it to 100%. I was able to create an equation in Excel (as shown in the top of the photo); but it is labor intensive. I am not sure how to create a R function that would automatically do it. I tried to start writing it (below) but I know the sum(x) part is inaccurate. I am not sure how to sum all of the functional group's percent cover for each treatment sorted by site, month and year. Perhaps using the aggregate function would help? Any help would be greatly appreciated!

normalize <- function(x, na.rm = TRUE) x*100/sum(x)

Spreadsheet of clipped data

Here's the reproducible example using the dput output.

structure(
 list(
  Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L),
  Month = structure(
   c(2L, 1L, 2L, 1L, 3L, 1L, 3L, 3L, 3L, 4L, 5L, 1L, 2L, 5L, 1L, 2L, 1L, 2L, 3L, 5L, 1L, 2L, 3L, 1L, 2L),
   .Label = c("1", "2", "3", "10", "11"),
   class = "factor"
   ),
  Site = structure(
   c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L),
   .Label = c("RR", "TMB"),
   class = "factor"
   ),
  Treatment = structure(
   c(6L, 7L, 7L, 5L, 5L, 1L, 1L, 4L, 2L, 3L, 4L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L),
   .Label = c("HLU", "U", "HU", "LU", "HL", "B", "H", "L", "P"),
   class = "factor"
   ), 
  Spp.Name = structure(
   c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L),
   .Label = c("Anemones", "Bare Rock", "Barnacles", "Biofilm", "Bleached Coarsely Branched", "Bleached Crustose", "Bleached Jointed Calcareous", "Bleached Sheet", "Brown Coarsely Branched", "Brown Crustose", "Brown Filamentous", "Brown Sheet", "Green Crustose", "Green Filamentous", "Green Sheet", "Mussels", "Red Coarsely Branched", "Red Crustose", "Red Filamentous", "Red Jointed Calcareous", "Red Sheet"),
   class = "factor"
   ), 
  Functional.Group = structure(
   c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
   .Label = c("Biofilm", "Bleached Coarsely Branched", "Bleached Crustose", "Bleached Jointed Calcareous", "Bleached Sheet", "Coarsely Branched", "Crustose", "Filamentous", "Invertebrates", "Jointed Calcareous", "Rock", "Sheet"),
   class = "factor"
   ), 
  Cover.Mean = c(12, 19, 2, 2, 6.66666666666667, 3, 13, 2, 1, 1, 3, 28, 9, 48.5, 5, 13, 39, 24, 5.66666666666667, 66.25, 6.66666666666667, 7, 4, 57.25, 41.25)
 ),
 row.names = c(NA, 25L),
 class = "data.frame"
)

Welcome to SO. Your problem is likely solved with some `group_by` and `mutate` operations from the `dplyr` package. That being said, help us help you, and create a reproducible example (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). To start with post the results of `dput(YOURDATA)` so we have something to work with. — Ben G, Dec 11 '19 at 20:41
What makes you think the sum part is inaccurate? Looks good to me (except that you’re not forwarding `na.rm`). That said, I *strongly* recommend against treating percentages as parts of 100 in code: this is mathematically *wrong*. Percentages are *fractions*, i.e. they sum to 1 — not to 100 (i.e. 100% = 1 ≠ 100)! Writing “x%” instead of “0.x” is simply a matter of representation, not of implementation. In other words, store fractions, display percentages via an appropriate formatting function. — Konrad Rudolph, Dec 11 '19 at 20:49
@BenG Thank you for the link and I have added a reproducible example with the 'dput' output. — vita_aquaticus, Dec 11 '19 at 23:10
@KonradRudolph. The sum part did not address the specific groupings (i.e. grouping of same Treatment within the same Year, Month and Site). Thanks for the tip about treating percentages. Will keep that in mind. — vita_aquaticus, Dec 11 '19 at 23:14

score 2 · Accepted Answer · answered Dec 12 '19 at 00:56

Such operations when you want to perform calculations for every unique value in column are called as grouped operations. There are various functions which would help you achieve what you want.

In base R, you can use ave

df$Std.Cover <- with(df,  Cover.Mean/ave(Cover.Mean, Year, Month, Site, Treatment, 
                FUN = sum) * 100)

So here, the first value Cover.Mean in ave is the variable on which we want to apply the function sum but it is done for each Year, Month, Site and Treatment. We divide the sum of each group by Cover.Mean to get ratio and multiply it by 100 to get percentage.

We can also use solutions from different packages like dplyr

library(dplyr)

df %>%
  group_by(Year, Month, Site, Treatment) %>%
  mutate(Std.Cover = Cover.Mean/sum(Cover.Mean) * 100)

Or data.table

library(data.table)
setDT(df)[, Std.Cover := Cover.Mean/sum(Cover.Mean) * 100, 
                        .(Year, Month, Site, Treatment)]

Brunox13 · Answer 2 · 2019-12-12T00:01:35.280

0

Assigning your reproducible example to the df variable, you should be able to do what you are trying to do this way:

for (i in 1:nrow(df)) {
  df$Std.Cover.Mean[i] <- df$Cover.Mean[i] * 100 / sum(
    df$Cover.Mean[
      which(
        df$Year == df$Year[i] & df$Month == df$Month[i] & df$Site == df$Site[i] & df$Treatment == df$Treatment[i]
        )
      ]
    )
  }

Essentially, the sum function adds up all the Cover.Mean values where Year, Month, Site, and Treatment are the same as those of the row in question.

edited Dec 12 '19 at 00:01

answered Dec 11 '19 at 20:47

Brunox13

775
1
7
21

1

Thank you @Brunox13 for the steps. The code you listed will add up all of the Cover.Mean values where Treatment is the same. I want all of the Cover.Mean values to be added where Treatment is the same within the same year, month and site. Does it make sense? – vita_aquaticus Dec 11 '19 at 23:11
I see, I originally misunderstood. I edited my answer to include all of these requirements! Essentially, you'd like to use the `which` function, which will contain all of the conditions for the `sum`. – Brunox13 Dec 12 '19 at 00:02

How to create a R function to normalize data to sum up to 100%

2 Answers2