User defined function based on multiple columns grouped by category

Question

This may be basic, but I've been trying to figure it out for days and haven't found an answer.

I am trying to calculate a new quantity based on two columns 'concentration' and 'area' grouped by 'catchment'. I've written a function to calculate the difference in concentration for each row and the row with the largest area normalized by proportion of area in that catchment, but it won't work with dplyr or aggregate (. It works fine with by, but then returns a list.

Ideally, I want to add a column onto the dataframe or replace the concentration column altogether. Here is the dataframe 'lev':

  area catchment concentration
1    1       Yup       2.00000
2   10       Yup      40.50000
3   25       Yup      50.82031
4   35       Yup      50.00000
5    1      Nope       1.00000
6   10      Nope       5.00000
7   25      Nope      40.08333
8   35      Nope      38.00000

Here is the function:

lever <- function(data=lev, x=data[,"concentration"], y=data[,"area"]){
N= which.max(y) 
L = (x - x[N]) * y/max(y)
return(L)}

And here is the desired result:

   area catchment concentration   leverage
1    1       Yup       2.00000 -1.3714286
2   10       Yup      40.50000 -2.7142857
3   25       Yup      50.82031  0.5859375
4   35       Yup      50.00000  0.0000000
5    1      Nope       1.00000 -1.0571429
6   10      Nope       5.00000 -9.4285714
7   25      Nope      40.08333  1.4880952
8   35      Nope      38.00000  0.0000000

Using by, I can get two lists with the results for each catchment:

by(lev, lev$catchment, lever)

but I want to use the function on multiple columns categorized by several factors (e.g. date in addition to catchment) and I get

'incorrect number of dimensions'

errors with doBy and dplyr.

We can give better answers if you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5965451#5965451). — Paul Rougieux, Feb 22 '17 at 15:25
Thanks for editing it to make it reproducible. I'll do better next time :) — benjabiker, Feb 22 '17 at 15:48

score 1 · Answer 1 · edited Feb 22 '17 at 15:40

1

We can use tidyverse

library(tidyverse)
df1 %>% 
  group_by(catchment) %>%
  mutate(leverage = (concentration- concentration[which.max(area)]) * area/max(area))

Based on the description, if there are multiple columns as grouping variable, place those in the group_by, and the calculation can also be applied to multiple columns with mutate_each

edited Feb 22 '17 at 15:40

Paul Rougieux

10,289
4
68
110

answered Feb 22 '17 at 15:32

akrun

874,273
37
540
662

Paul Rougieux · Accepted Answer · 2017-02-22T16:39:57.493

Loading your data:

lev <- read.table(text = "area catchment concentration
    1       Yup       2.00000
   10       Yup      40.50000
   25       Yup      50.82031
   35       Yup      50.00000
    1      Nope       1.00000
   10      Nope       5.00000
   25      Nope      40.08333
   35      Nope      38.00000", 
   header=TRUE)

Grouped by catchment

library(dplyr)
lev %>% 
    group_by(catchment) %>% 
    mutate(N = which.max(area),
           L = (concentration - concentration[N]) * area/max(area))

# 
#    area catchment concentration     N          L
#   <int>    <fctr>         <dbl> <int>      <dbl>
# 1     1       Yup       2.00000     4 -1.3714286
# 2    10       Yup      40.50000     4 -2.7142857
# 3    25       Yup      50.82031     4  0.5859357
# 4    35       Yup      50.00000     4  0.0000000
# 5     1      Nope       1.00000     4 -1.0571429
# 6    10      Nope       5.00000     4 -9.4285714
# 7    25      Nope      40.08333     4  1.4880929
# 8    35      Nope      38.00000     4  0.0000000

Using your function

I modify your function so that it returns a data frame.

lever2 <- function(data, 
                   x = data[,"concentration"][[1]], 
                   y = data[,"area"][[1]]){
    # Use [[1]] to extract the vector only
    N <- which.max(y)
    L <- (x - x[N]) * y/max(y)
    # Put L back into the data frame 
    # so that we keep the concentration and area in the result
    data$L <- L
    return(data)
    }

The funtion can then be used with dplyr::group_by %>% do

lev %>% 
    group_by(catchment) %>% 
    do( lever2(.))

Yes I was writing at the same time on my laptop, but I'm slower than you. I though about adding an example using the OP's function `lever` and the `group_by` %>% `do` mechanism, but somehow this returns `(list) object cannot be coerced to type 'double'` I still have to figure out how to make this one work. — Paul Rougieux, Feb 22 '17 at 15:42
Works perfect! If I have multiple columns (e.g. concentration1, concentration2), how could I add the L to the dataframe for each one? — benjabiker, Feb 22 '17 at 15:50
Edit the `mutate` instruction `L = (concentration2 - concentration2[N]) * area/max(area)`. But if you have a wide data structure, you might consider reshaping the data frame to a long format with [tidyr::gather](ftp://cran.r-project.org/pub/R/web/packages/tidyr/vignettes/tidy-data.html) before performing the `mutate`. — Paul Rougieux, Feb 22 '17 at 15:57
@benjabiker I added an example using a modified version of your function. — Paul Rougieux, Feb 22 '17 at 16:40

score 1 · Answer 3 · answered Feb 22 '17 at 15:39

You can also use data.table to calculate this value:

library(data.table)
# convert to data.table
setDT(df)

df[, leverage := (concentration - concentration[which.max(area)]) * (area / max(area)),
   by=catchment]
df
   area catchment concentration   leverage
1:    1       Yup       2.00000 -1.3714286
2:   10       Yup      40.50000 -2.7142857
3:   25       Yup      50.82031  0.5859357
4:   35       Yup      50.00000  0.0000000
5:    1      Nope       1.00000 -1.0571429
6:   10      Nope       5.00000 -9.4285714
7:   25      Nope      40.08333  1.4880929
8:   35      Nope      38.00000  0.0000000

data

df <-
structure(list(area = c(1L, 10L, 25L, 35L, 1L, 10L, 25L, 35L), 
    catchment = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Nope", 
    "Yup"), class = "factor"), concentration = c(2, 40.5, 50.82031, 
    50, 1, 5, 40.08333, 38)), .Names = c("area", "catchment", 
"concentration"), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8"))

User defined function based on multiple columns grouped by category

3 Answers3

Using your function