Concatenating cells in a data frame using aggregate.data.frame

Question

I have a data frame like the following:

df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d")) 
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df

    bee.num plant                        visits
1       1     d                          1, 2, 3
2       1     d                 4, 5, 6, 7, 8, 9
3       1     w                           10, 11
4       2     d    1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5       2     d                           11, 12
6       3     w                       1, 2, 3, 4
7       3     d            5, 6, 7, 8, 9, 10, 11

I would like to aggregate visits by bee.num and plant with a function that concatenates the values for visit based on matching bee.num and plant values, like the one below

     bee.num plant                        visits
1       1     d                1, 2, 3, 4, 5, 6, 7, 8, 9
2       1     w                                   10, 11
3       2     d    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
4       3     w                               1, 2, 3, 4
5       3     d                    5, 6, 7, 8, 9, 10, 11

I've tried

aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)

and

aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=unlist)

but I always get an "arguments imply differing number of rows" error. Any help would be greatly appreciated. Thanks in advance.

I suggest you provide us with a reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Dec 15 '13 at 00:02
Thanks, just added some code for reproducing the first data frame - hope that addresses your suggestion. — user2849910, Dec 15 '13 at 00:22

Matthew Lundberg · Answer 1 · 2013-12-15T03:40:32.340

The function works as expected if you pass a data frame containing the list as a column, rather than pass the list itself.

x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
##   bee.num plant                                visits
## 1       1     d             1, 2, 3, 4, 5, 6, 7, 8, 9
## 2       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3       3     d                 5, 6, 7, 8, 9, 10, 11
## 4       1     w                                10, 11
## 5       3     w                            1, 2, 3, 4

Note:

> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"

It would thus suffice to call aggregate above.

Note also, the error is from trying to coerce the list to a data frame. The first two lines of aggregate.data.frame are as follows:

if (!is.data.frame(x)) 
    x <- as.data.frame(x)

Applying this to df$visits results in:

as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE,  : 
##   arguments imply differing number of rows: 3, 6, 2, 10, 4, 7

Only "rectangular" lists can be coerced to data.frame. All entries must be the same length.

Thanks for the helpful explanation. Everything seems to check out perfectly with my larger data set I'm working with. — user2849910, Dec 15 '13 at 17:20
+1. Interesting finding. It's a little bit annoying that we can't use `aggregate.formula` though, hence [my suggestion](http://stackoverflow.com/a/20595762/1270695) to "melt" the `data.frame` before doing further work with it. +1 for the explanation. — A5C1D2H2I1M1N2O1R2T1, Dec 16 '13 at 06:21
@AnandaMahto yes, that is the advantage to your approach. I do prefer `aggregate.formula` for readability, but it refuses to work with list columns. In fact, I was meaning to +1 your post for that very reason. — Matthew Lundberg, Dec 16 '13 at 06:23

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2013-12-16T06:25:42.830

You can also get the output you're looking for if you unlist the list column first and make it so you have a long data.frame to start with:

visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
#   bee.num plant                                     x
# 1       1     d             1, 2, 3, 4, 5, 6, 7, 8, 9
# 2       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3       3     d                 5, 6, 7, 8, 9, 10, 11
# 4       1     w                                10, 11
# 5       3     w                            1, 2, 3, 4

## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)

By the way, "data.table" can handle this listing and unlisting pretty directly:

library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
#    bee.num plant        visits
# 1:       1     d  1,2,3,4,5,6,
# 2:       1     w         10,11
# 3:       2     d  1,2,3,4,5,6,
# 4:       3     d 5,6,7,8,9,10,
# 5:       3     w       1,2,3,4

The output there only looks truncated. All the information is there:

str(.Last.value)
# Classes ‘data.table’ and 'data.frame':  5 obs. of  3 variables:
#  $ bee.num: num  1 1 2 3 3
#  $ plant  : Factor w/ 2 levels "d","w": 1 2 1 1 2
#  $ visits :List of 5
#   ..$ : int  1 2 3 4 5 6 7 8 9
#   ..$ : int  10 11
#   ..$ : int  1 2 3 4 5 6 7 8 9 10 ...
#   ..$ : int  5 6 7 8 9 10 11
#   ..$ : int  1 2 3 4
#  - attr(*, "sorted")= chr  "bee.num" "plant"
#  - attr(*, ".internal.selfref")=<externalptr>

As an advantage (although not demonstrated by the answer), you can use `aggregate.formula` with this "long" data frame. — Matthew Lundberg, Dec 16 '13 at 06:24

joemienko · Answer 3 · 2013-12-15T00:51:26.787

In answer to your specific question, I don't think aggregate.data.frame will do this easily.

As I've stated in previous posts, most R users would probably come up with a way to do this in plyr.

However, as my first exposure to data analysis was through database scripting, I remain partial to the sqldf package for these sorts of tasks.

I also find SQL to be more transparent to non-R users (something I frequently encounter in the social science community where I do most of my work).

Here is a solution to your problem using sqldf:

#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
            ,"4, 5, 6, 7, 8, 9"
            ,"10, 11"
            ,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
            ,"11, 12"
            ,"1, 2, 3, 4"
            ,"5, 6, 7, 8, 9, 10, 11")


dat <- as.data.frame(cbind(bee_num, plant, visits))

#load sqldf
require(sqldf)

#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the 
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
        bee_num
        ,plant
        ,group_concat(visits) visits
      from dat
      group by 
        bee_num
        ,plant')


  bee_num plant                               visits
1       1     d             1, 2, 3,4, 5, 6, 7, 8, 9
2       1     w                               10, 11
3       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4       3     d                5, 6, 7, 8, 9, 10, 11
5       3     w                           1, 2, 3, 4

Fair point. My data were self-created based on an earlier version of the question in which it was not clear whether or not the `visits` column was integer or string. If the `visits` is stored as a list of integer vectors (instead of comma separated values), then sqldf will not work so simply. In any case, it appears that aggregate.data.frame is more flexible than I had thought. — joemienko, Dec 16 '13 at 07:44

Concatenating cells in a data frame using aggregate.data.frame

3 Answers3