9

I have a situation where my data frame contains the results of image analysis where the columns are the proportion of a particular class present in the image, such that an example dataframe class_df would look like:

id    A    B    C    D    E    F
 1 0.20 0.30 0.10 0.15 0.25 0.00 
 2 0.05 0.10 0.05 0.30 0.10 0.40
 3 0.10 0.10 0.10 0.20 0.20 0.30

Each of these classes belongs to a functional group and I want to create new columns where the proportions of each functional group are calculated from the classes. An example mapping class_fg

class         fg
    A          Z
    B          Z
    C          Z
    D          Y
    E          Y
    F          X

and the desired result would be (line added to show the desired new columns):

id    A    B    C    D    E    F |    X    Y    Z
 1 0.20 0.30 0.10 0.15 0.25 0.00 | 0.00 0.40 0.60
 2 0.05 0.10 0.05 0.30 0.10 0.40 | 0.40 0.40 0.20
 3 0.10 0.10 0.10 0.20 0.20 0.30 | 0.30 0.40 0.30

And I can do it one functional group at a time using

first_fg <- class_fg %>%
  filter(fg == "Z") %>%
  select(class) %>%
  unlist()

class_df <- class_df %>%
  mutate(Z = rowSums(select(., one_of(first_fg))))

Surely there is a better way to do this where I can calculate the row sum for each functional group without having to just repeat this code for each group? Maybe using purrr?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Syzorr
  • 587
  • 1
  • 5
  • 17
  • 1
    Yes that's called **aggregation** then summarization. Do `class_fg %>% group_by(fg) %>% summarize(...your summary code goes here...)` – smci May 22 '18 at 05:37
  • Sorry @Ronak, updated correctly. "label" should have been "class" – Syzorr May 22 '18 at 05:43
  • @smci - I don't see how that would allow me to create a summary for the class_df which is what I'm actually wanting to summarize? – Syzorr May 22 '18 at 05:45
  • Your code was confusing because you never named your df. (Is it called `class_fg` or `class_df`? What is `class_df`?) Either way, the solution you want is `whatever_your_df_is_actually_called %>% group_by(fg) %>% summarize(...your summary code goes here...)` – smci May 22 '18 at 05:48
  • Sounds like it might be useful to transform `class_df` into long form, then join in the `fg` based on `class`, then aggregate and transform back to wide format. It's hard to say without actually being able to see the data though; could you provide a runnable example? – Mikko Marttila May 22 '18 at 05:49
  • 1
    I've updated to make the examples clearer. I'm working with a very large data set where I'm unsure if I can share (and using it as an example would be too large here), so hopefully the improved examples help. – Syzorr May 22 '18 at 05:57

5 Answers5

7

We could split the 'class_df' by 'class', loop through the list elements with map, select the columns of 'class_df' and get the rowSums

library(tidyverse)
class_fg %>%
    split(.$fg) %>% 
    map_df(~ class_df %>%
                select(one_of(.x$class)) %>% 
                rowSums) %>%
    bind_cols(class_df, .)
#  id    A   B    C    D    E   F   X   Y   Z
#1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

Or do a group by nesting, and then do the rowSums by mapping over the list elements

class_fg %>% 
   group_by(fg) %>%
   nest %>%
   mutate(out = map(data, ~  class_df %>%
                               select(one_of(.x$class)) %>% 
                               rowSums)) %>% 
   select(-data)  %>%
   unnest %>% 
   unstack(., out ~ fg) %>% 
   bind_cols(class_df, .)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you, I had the feeling that it was possible with a single statement like that. I'm still coming to grips with how to nest effectively without immediately resorting to creating a long format table. – Syzorr May 22 '18 at 06:13
6

Always it is easier to work on data in long format. Hence, change class_df to long format using tidyr:gather and join with class_fg. Perform analysis in long format on your data. Finally, spread in wide-format to match expected result.

library(tidyverse)

class_df %>% gather(key, value, -id) %>% 
  inner_join(class_fg, by=c("key" = "class")) %>%
  group_by(id, fg) %>%
  summarise(value = sum(value)) %>%
  spread(fg, value) %>%
  inner_join(class_df, by="id") %>% as.data.frame()

#   id   X   Y   Z    A   B    C    D    E   F
# 1  1 0.0 0.4 0.6 0.20 0.3 0.10 0.15 0.25 0.0
# 2  2 0.4 0.4 0.2 0.05 0.1 0.05 0.30 0.10 0.4
# 3  3 0.3 0.4 0.3 0.10 0.1 0.10 0.20 0.20 0.3

Data:

class_fg <- read.table(text = 
"class         fg
                 A          Z
                 B          Z
                 C          Z
                 D          Y
                 E          Y
                 F          X",
header = TRUE, stringsAsFactors = FALSE)

class_df  <- read.table(text = 
"id    A    B    C    D    E    F
1 0.20 0.30 0.10 0.15 0.25 0.00 
2 0.05 0.10 0.05 0.30 0.10 0.40
3 0.10 0.10 0.10 0.20 0.20 0.30",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • Want to say thank you because adding all these answers to a small example library so I can be more flexible in future. – Syzorr May 22 '18 at 08:16
5

Yet another option, along with the already contributed working solutions, would be to use quasiquotation with the rlang package to build expressions to calculate the sums in each group.

library(tidyverse)

First, define a helper function for doing an elementwise sum of vectors:

psum <- function(...) reduce(list(...), `+`)

Extracting the groupings into a list from class_fg we can then construct a list of expressions to calculate the sum in each group:

sum_exprs <- with(class_fg, split(class, fg)) %>% 
  map(~ rlang::expr(psum(!!!rlang::syms(.x))))

sum_exprs
#> $X
#> psum(F)
#> 
#> $Y
#> psum(D, E)
#> 
#> $Z
#> psum(A, B, C)

With the list of expressions ready, we can just "bang-bang-bang" (!!!) them into the data with mutate:

class_df %>%
  mutate(!!!sum_exprs)
#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

(I used the code provided by @MKR in his answer to create the data).

Created on 2018-05-22 by the reprex package (v0.2.0).

Mikko Marttila
  • 10,972
  • 18
  • 31
  • Want to say thank you because adding all these answers to a small example library so I can be more flexible in future. – Syzorr May 22 '18 at 08:16
1

My usual approach is to stick to base R as long as the data sets don't get too large. In your case, a base R solution would be:

class_df=as.data.frame(
  c(class_df,
    lapply(split(class_fg,class_fg$fg),
           function(x) rowSums(class_df[,x$class,drop=FALSE]))))
class_df
#  id    A   B    C    D    E   F   X   Y   Z
#1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

If the data sets get too large, I use data.table. A data.table solution for your problem:

library(data.table)

class_dt=data.table(class_df)
grps=split(class_fg,class_fg$fg)

for (g in grps) class_dt[,c(g$fg[1]):=rowSums(.SD),.SDcols=g$class,]
class_dt
#   id    A   B    C    D    E   F   X   Y   Z
#1:  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2:  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3:  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
cryo111
  • 4,444
  • 1
  • 15
  • 37
0

Another tidyverse solution using rowSums on column subsets :

library(tidyverse)
class_fg %>%
  group_by(fg) %>% 
  summarize(list(rowSums(class_df[class]))) %>%
  spread(1,2) %>%
  unnest() %>%
  bind_cols(class_df, .)

#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

Or for the glory of code golf :

x <- with(class_fg, tapply(class, fg, reformulate))
mutate(class_df, !!!map(x, ~as.list(.)[[2]]))
#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167