Aggregating Data in R dataframe

Question

i have this data frame:

   ers1 task
 t1    3     t1
 t2    3     t2
 t3    3     t3
 t4    4     t4
 t5    3     t5
 t6    4     t6
 t7    4     t7
 t8    3     t8

The data describes the task distribution (8 different tasks) for a group of employees. i would like to aggregate the tasks for a given employee, in order to get something like this:

 ers task1 task2 task3 task4 task5
 3   t1    t2     t3    t5    t8
 4   t4    t6     t7

Any suggestion? Thanks

`tidyr::spread` (https://tidyr.tidyverse.org/). I don't think of this as aggregating, more as reshaping. The former typically results in a loss of data when you summarize 2 or more data to a single (with `mean`, `median`, `head`, `min`, `sample`, etc); the latter starts with `prod(dims(dat))` and ends with `prod(dims(wdat))` being at least that value (occasionally more, depending on missingness). — r2evans, Sep 05 '18 at 01:25
This might be helpful place to begin with https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format — Ronak Shah, Sep 05 '18 at 01:27

score 0 · Answer 1 · answered Sep 05 '18 at 01:45

Please see the following solution in base R for one creative approach.

d1 <- data.frame(ers1 = c(3,3,3,4,3,4,4,3), task = paste0("t",1:8)) # raw data
d2 <- table(d1) # use table to do the reshaping work here
l1 <- apply(d2, 1, function(x) colnames(d2)[index(x)*x]) # use the 0s and 1s to fill tasks
d3 <- t(sapply(l1, '[', seq(max(sapply(l1, length))))) # combine lists of varying lengths
colnames(d3) <- paste0("t",1:ncol(d3)) # create colnames
d3[is.na(d3)] <- "" # change NAs to blanks as desired
d3
#  t1   t2   t3   t4   t5  
#3 "t1" "t2" "t3" "t5" "t8"
#4 "t4" "t6" "t7" ""   ""

Personally, I would stop after getting l1 (list 1) because it seems more useful for programming/applying functions.

score 0 · Answer 2 · answered Sep 05 '18 at 02:06

0

thanks for letting me know the spread command from tidyverse library.

I solved the problem in one line

spread(ers1, key=task, value = task)

answered Sep 05 '18 at 02:06

Martin

41
7

Except that doesn't give you the format you requested. You might have to get a bit fancier like `dat %>% group_by(ers1) %>% mutate(time=paste0("task",row_number())) %>% spread(key=time, value=task)` or in base R - `reshape(transform(dat, time=ave(ers1,ers1,FUN=seq_along)), idvar="ers1", direction="wide", sep="")` – thelatemail Sep 05 '18 at 02:11
Thanks! you´re right, your solution is more precise since it only generates 5 columns instead of 8. – Martin Sep 05 '18 at 02:21

Chriss Paul · Accepted Answer · 2018-09-05T23:44:13.787

Also an alternative using split from data.table

library(data.table)
split(setDT(d1),by=c("ers1"),keep.by = FALSE)

$`3`
   task
1:   t1
2:   t2
3:   t3
4:   t5
5:   t8

$`4`
   task
1:   t4
2:   t6
3:   t7

Then, to obtain your required format:

sapply( split(setDT(d1),by="ers1",keep.by = FALSE),'[',1:5)
$`3.task`
[1] t1 t2 t3 t5 t8
Levels: t1 t2 t3 t4 t5 t6 t7 t8

$`4.task`
[1] t4   t6   t7   <NA> <NA>
Levels: t1 t2 t3 t4 t5 t6 t7 t8

Aggregating Data in R dataframe

3 Answers3