0

I have data set of 20 year measurements (14600x6) and need to get a geometric mean value of $tu per $name and $trophic. Originally, I had my df split in three dfs and I did as follow:

Old code based on split df!!!

trophic_pp<- df_pp %>% select(sites, name, tu_pp)%>%
  group_by(name) %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = name, values_from = tu_pp) %>%
  replace(is.na(.), 0)%>%
  select(-row)
trophic_dc<- ...... same
trophic_pt<- ...... same

then

trophic_pp<- trophic_pp%>%
  mutate(sum_pp = rowSums(across(where(is.numeric))))
trophic_dc<- ...... same
trophic_pt<- ...... same

then

trophic_pp_sites <- select("trophic_pp", "sites", "sum_pp") %>%
  group_by(sites) %>%
  summarise(gmean = gmean(sum_pp)) %>%
  add_column(trophic = "pp", .before = "gmean")
trophic_dc<- ...... same
trophic_pt<- ...... same

then I merged and reduced to finally plot

all_trophic <- Reduce(function(x, y) merge(x, y, all=TRUE), list(trophic_pp,
                                                                 trophic_dc,
                                                                 trophic_pt)) %>%
  mutate(type = case_when(
    startsWith(sites, "R") ~ "river",
    startsWith(sites, "T") ~ "tributary"
    ))

As you can observe it is a long and repetitive code.

I rearranged my data to only one df instead of three and the str look like this now:

tibble [14,100 x 6] (S3: tbl_df/tbl/data.frame)
     $ name             : Factor w/ 6 levels "Al","As","Cu",..: 1 1 1 1 1 1 1 1 1 1 ...
     $ cas              : chr [1:14100] "7429-90-5" "7429-90-5" "7429-90-5" "7429-90-5" ...
     $ sites            : chr [1:14100] "R1" "R1" "R1" "R5" ...
     $ conc             : num [1:14100] 12.12 12.12 12.12 2.06 2.06 ...
     $ trophic          : chr [1:14100] "tu_pp" "tu_pc" "tu_sc" "tu_pp" ...
     $ tu               : num [1:14100] 12.41 4.83 7.22 2.11 0.82 ...

Where each $name has its own $cas, 9 $sites, and each $tu is calculated based on $conc and in three different $trophics. Therefore, $tu is the only variable changing in every single row.

I am struggling calculating the geometric mean. I tried as follow:

define geometric mean function

gmean <- function(x, na.rm=TRUE){
  gmean = exp(mean(log(x)))
}

Created a list based on $trophic

trophic_list <- split(df, df$trophic)

and run lapply function through the list

for (i in seq_along(trophic_list)) {
  
  trophic_list[[i]] <- within(trophic_list[[i]], {

  gmean <- lapply(trophic_list[tu], FUN: gmean
    
  })
}

Sorry for the long explanation and I´ll appreciate your help

Pedr Nton
  • 79
  • 7
  • Is tidyverse solution an option? – Reeza Feb 24 '21 at 20:37
  • any advise to reduce the repetition in my code is welcome – Pedr Nton Feb 24 '21 at 20:48
  • 2
    Maybe I'm missing something since you didn't post sample data but it seems like you could just add more variables to your `group` statements since you're grouping by more than one variable and then use a transpose/pivot_wider to rename things if you really want them set up that way. – Reeza Feb 24 '21 at 20:50
  • 1
    Your original code was `trophic_pp<- df_pp %>% select(sites, name, tu_pp)%>% group_by(name) %>%`. Can't you just use the merged dataframe and do something like `trophic_merged<- df_merged %>% select(sites, name, tu_pp)%>% group_by(name,trophic) %>%`? Adding `trophic` column to the grouping statement is like running the code 3 ways. – Adam Sampson Feb 24 '21 at 21:04
  • Never splitting seems to make more sense, pivot_wider and grab the data from trophic to rename the measures. Which may or may not even be necessary depending on what is happening next. – Reeza Feb 24 '21 at 21:07
  • finally, I did not split the data and grouped them by new variables. It was just easy and straightforward. Thanks for your advises. I struggle sometimes about what is the best way to organise the data and later downstream analysis. Very helpful your advise! – Pedr Nton Feb 26 '21 at 08:01

1 Answers1

0

If you can use tidy verse, this is one way to accomplish what you want:

library(tidyverse)

#use cars to play with
cars <- mpg

#function for geometric mean
#from here https://stackoverflow.com/questions/2602583/geometric-mean-is-there-a-built-in
geo_mean = function(x, na.rm=TRUE){
    exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}

#calculate geometric mean per manufacture and year
#in your case group by trophic/name
geo_mean_summary <- cars %>%
    group_by(manufacturer, year) %>%
    summarize(geoMean_City = geo_mean(cty),
              geoMean_HWY = geo_mean(hwy))

Note the posts comments about how to handle negative values, 0 or missing, if applicable to your situation.

Reeza
  • 20,510
  • 4
  • 21
  • 38