0

How can I use dplyr/tidyr to covert a data frame like this:

df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
             S1=rep(c("a","b"),length.out=10),S1PR=rep(c(3,7),length.out=10),
             S2=rep(c("c","d"),length.out=10),S2PR=rep(c(7,3),length.out=10),
             Relsize=c(.4,.6,.4,.6,.2,.2,.6,.2,.2,.6))


   obj S1 S1PR S2 S2PR Relsize
1    1  a    3  c    7     0.4
2    1  b    7  d    3     0.6
3    2  a    3  c    7     0.4
4    2  b    7  d    3     0.6
5    3  a    3  c    7     0.2
6    3  b    7  d    3     0.2
7    3  a    3  c    7     0.6
8    4  b    7  d    3     0.2
9    4  a    3  c    7     0.2
10   4  b    7  d    3     0.6

And turn it in to one like this:

obj  a    b    c    d
 1  0.12 0.42 0.28 0.18
 2  0.12 0.42 0.28 0.18
 3  0.24 0.14 0.56 0.06

The values in the output data frame are based on PR/10*Relsize. EDIT: where there is more than one entry for the same species they should be summed I've been trying to do this with some combination of spread and gather but I'm not sure how to do it.

see24
  • 1,097
  • 10
  • 21
  • 1
    Relsize isn't in your example df – Calum You Mar 08 '18 at 17:47
  • Moreover the expected result should include row for `obj = 4`. – MKR Mar 08 '18 at 19:08
  • Further, it isn't clear how to combine the input into the output. For example, the output has only one value for `a` for `obj = 3`, but there are two rows with `obj = 3` and `S1 = a`. SImilarly for 4. What do you propose to do in this situation? – Calum You Mar 08 '18 at 19:57

4 Answers4

1

Using base R:

 df <- data.frame(stringsAsFactors = F,obj=c(1,1,2,2,3,3,3,4,4,4),
                                 S1=rep(c("a","b"),length.out=10),S1PR=rep(c(3,7),length.out=10),
                                   S2=rep(c("c","d"),length.out=10),S2PR=rep(c(7,3),length.out=10))
 df$Relsize=c(0.4,0.6,.4,.6,.2,.2,.6,.2,.2,.6)
 #Create the next two columns required:. I will call them `oo and pp`
 df1=transform(df,oo=S1PR*Relsize/10,pp=S2PR*Relsize/10)

Group by the columns obj,s1 and s2 and perform a sum on oo and pp we use aggregate

m=aggregate(.~obj+S1+S2,df1,sum)#aggregate(cbind(oo,pp)~obj+S1+S2,df1,sum)

Now rearrange the resultant dataframe as shown I e, the contents of columns S1,S2 should be the new column names while the contents of oo,pp should be spread within these columns. We use unstack

unstack(m,cbind(oo,pp)~cbind(S1,S2))
     a    b    c    d
1 0.12 0.42 0.28 0.18
2 0.12 0.42 0.28 0.18
3 0.24 0.14 0.56 0.06
4 0.06 0.56 0.14 0.24
Onyambu
  • 67,392
  • 3
  • 24
  • 53
1

Another option could be using dcast from reshape2 to arrive :

library(tidyverse)
library(reshape2)

df_mod <- df %>%
  mutate(S1PR = (S1PR/10)*Relsize, 
         S2PR = (S2PR/10)*Relsize) %>%
  select(-Relsize)

  bind_rows(x = select(df_mod,obj, S := S1, PR := S1PR), 
            y = select(df_mod, obj, S := S2, PR :=S2PR)) %>%
    dcast(obj ~ S, sum)


#  obj    a    b    c    d
#1   1 0.12 0.42 0.28 0.18
#2   2 0.12 0.42 0.28 0.18
#3   3 0.24 0.14 0.56 0.06
#4   4 0.06 0.56 0.14 0.24

Data

df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
                 S1=rep(c("a","b"),length.out=10),S1PR=rep(c(3,7),length.out=10),
                 S2=rep(c("c","d"),length.out=10),S2PR=rep(c(7,3),length.out=10),
                 Relsize=c(.4, .6, .4, .6, .2, .2, .6, .2, .2, .6),
                 stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
1

Using dplyr and tidyr but the same as MKR's solution.

OP didn't specify if summing is correct.

Perhaps the bind_rows could be more elegant but it seems all the suggestions so far do something similar.

df %>% 
    as_data_frame %>% 
    mutate(Relsize=c(0.4,0.6,.4,.6,.2,.2,.6,.2,.2,.6)) %>% 
    mutate(S1PR = (S1PR/10)*Relsize, S2PR = (S2PR/10)*Relsize) %>%  
    {bind_rows(select(., obj, S = S1, PR = S1PR),
               select(., obj, S = S2, PR = S2PR)
               )
    } %>% 
    group_by(obj, S) %>% 
    summarise(PR=sum(PR)) %>% 
    spread(S, PR)
Jan Stanstrup
  • 1,152
  • 11
  • 28
0

I figured out a way that avoids using the bind_rows

  df %>% 
    unite(S1_S1PR, contains("S1")) %>% 
    unite(S2_S2PR, contains("S2")) %>% 
    gather(x, Species, S1_S1PR, S2_S2PR) %>% 
    separate(Species, into = c("Species","PR"), convert = T) %>% 
    group_by(obj, Species) %>% 
    mutate(PR = sum(PR/10*Relsize)) %>% 
    select(-x, -Relsize) %>% distinct() %>% 
    spread(Species, PR)

It is a bit more readable but takes more steps. I find that I see data like this often in ecology where each row is a location and there is a set of columns like SX and SXPR that are repeated for the first, second, and third most common species at that location. Being able to get all the matching columns using dplyr's nice select helpers could be really useful with lots of columns

see24
  • 1,097
  • 10
  • 21