4

I have a data frame like the following:

df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d")) 
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df

    bee.num plant                        visits
1       1     d                          1, 2, 3
2       1     d                 4, 5, 6, 7, 8, 9
3       1     w                           10, 11
4       2     d    1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5       2     d                           11, 12
6       3     w                       1, 2, 3, 4
7       3     d            5, 6, 7, 8, 9, 10, 11

I would like to aggregate visits by bee.num and plant with a function that concatenates the values for visit based on matching bee.num and plant values, like the one below

     bee.num plant                        visits
1       1     d                1, 2, 3, 4, 5, 6, 7, 8, 9
2       1     w                                   10, 11
3       2     d    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
4       3     w                               1, 2, 3, 4
5       3     d                    5, 6, 7, 8, 9, 10, 11

I've tried

aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)

and

aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=unlist)

but I always get an "arguments imply differing number of rows" error. Any help would be greatly appreciated. Thanks in advance.

user2849910
  • 191
  • 1
  • 2
  • 6
  • 1
    I suggest you provide us with a reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Dec 15 '13 at 00:02
  • Thanks, just added some code for reproducing the first data frame - hope that addresses your suggestion. – user2849910 Dec 15 '13 at 00:22

3 Answers3

3

The function works as expected if you pass a data frame containing the list as a column, rather than pass the list itself.

x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
##   bee.num plant                                visits
## 1       1     d             1, 2, 3, 4, 5, 6, 7, 8, 9
## 2       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3       3     d                 5, 6, 7, 8, 9, 10, 11
## 4       1     w                                10, 11
## 5       3     w                            1, 2, 3, 4

Note:

> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"

It would thus suffice to call aggregate above.

Note also, the error is from trying to coerce the list to a data frame. The first two lines of aggregate.data.frame are as follows:

if (!is.data.frame(x)) 
    x <- as.data.frame(x)

Applying this to df$visits results in:

as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE,  : 
##   arguments imply differing number of rows: 3, 6, 2, 10, 4, 7

Only "rectangular" lists can be coerced to data.frame. All entries must be the same length.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Thanks for the helpful explanation. Everything seems to check out perfectly with my larger data set I'm working with. – user2849910 Dec 15 '13 at 17:20
  • +1. Interesting finding. It's a little bit annoying that we can't use `aggregate.formula` though, hence [my suggestion](http://stackoverflow.com/a/20595762/1270695) to "melt" the `data.frame` before doing further work with it. +1 for the explanation. – A5C1D2H2I1M1N2O1R2T1 Dec 16 '13 at 06:21
  • @AnandaMahto yes, that is the advantage to your approach. I do prefer `aggregate.formula` for readability, but it refuses to work with list columns. In fact, I was meaning to +1 your post for that very reason. – Matthew Lundberg Dec 16 '13 at 06:23
1

You can also get the output you're looking for if you unlist the list column first and make it so you have a long data.frame to start with:

visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
#   bee.num plant                                     x
# 1       1     d             1, 2, 3, 4, 5, 6, 7, 8, 9
# 2       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3       3     d                 5, 6, 7, 8, 9, 10, 11
# 4       1     w                                10, 11
# 5       3     w                            1, 2, 3, 4

## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)

By the way, "data.table" can handle this listing and unlisting pretty directly:

library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
#    bee.num plant        visits
# 1:       1     d  1,2,3,4,5,6,
# 2:       1     w         10,11
# 3:       2     d  1,2,3,4,5,6,
# 4:       3     d 5,6,7,8,9,10,
# 5:       3     w       1,2,3,4

The output there only looks truncated. All the information is there:

str(.Last.value)
# Classes ‘data.table’ and 'data.frame':  5 obs. of  3 variables:
#  $ bee.num: num  1 1 2 3 3
#  $ plant  : Factor w/ 2 levels "d","w": 1 2 1 1 2
#  $ visits :List of 5
#   ..$ : int  1 2 3 4 5 6 7 8 9
#   ..$ : int  10 11
#   ..$ : int  1 2 3 4 5 6 7 8 9 10 ...
#   ..$ : int  5 6 7 8 9 10 11
#   ..$ : int  1 2 3 4
#  - attr(*, "sorted")= chr  "bee.num" "plant"
#  - attr(*, ".internal.selfref")=<externalptr> 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
0

In answer to your specific question, I don't think aggregate.data.frame will do this easily.

As I've stated in previous posts, most R users would probably come up with a way to do this in plyr.

However, as my first exposure to data analysis was through database scripting, I remain partial to the sqldf package for these sorts of tasks.

I also find SQL to be more transparent to non-R users (something I frequently encounter in the social science community where I do most of my work).

Here is a solution to your problem using sqldf:

#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
            ,"4, 5, 6, 7, 8, 9"
            ,"10, 11"
            ,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
            ,"11, 12"
            ,"1, 2, 3, 4"
            ,"5, 6, 7, 8, 9, 10, 11")


dat <- as.data.frame(cbind(bee_num, plant, visits))

#load sqldf
require(sqldf)

#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the 
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
        bee_num
        ,plant
        ,group_concat(visits) visits
      from dat
      group by 
        bee_num
        ,plant')


  bee_num plant                               visits
1       1     d             1, 2, 3,4, 5, 6, 7, 8, 9
2       1     w                               10, 11
3       2     d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4       3     d                5, 6, 7, 8, 9, 10, 11
5       3     w                           1, 2, 3, 4
joemienko
  • 2,220
  • 18
  • 27
  • Your input data doesn't seem to match the OP's input data. – A5C1D2H2I1M1N2O1R2T1 Dec 16 '13 at 06:15
  • Fair point. My data were self-created based on an earlier version of the question in which it was not clear whether or not the `visits` column was integer or string. If the `visits` is stored as a list of integer vectors (instead of comma separated values), then sqldf will not work so simply. In any case, it appears that aggregate.data.frame is more flexible than I had thought. – joemienko Dec 16 '13 at 07:44