Aggregating multiple columns from data frame

Question

I have a data frame that has a bunch of data that's joined with commas in certain elements of the rows. Something that looks like:

df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014)
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")
)
 names(df) <- c("year", "type")

I want to get it in a form that dcast is close to getting it to, with the year,a,b,c,etc being the columns, and the frequency across the data frame being in the cells of the resultant data frame. I tried first to do colsplit on df and then use dcast after, but that seems to only work if I want to aggregate on one of the levels instead of all.

 df2 <- data.frame( df$year,  colsplit(df$type, ',' , c('v1','v2','v3','v4','v5')) )
 df3 <- dcast(df2, df.year ~ v1)

This result only gives me for the first level of the colsplit, instead of all of them. Am I close to a solution or should I be using a different approach entirely?

akrun · Answer 1 · 2017-08-30T06:39:39.307

3

Here is a single line option with base R by splitting the 'type' column with strsplit, then set the names of the list output as 'year', stack it to a single data.frame and get the frequency count using table

table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1])
#        values
#ind    a b c d e f g h i
#  2012 2 1 3 2 1 1 0 0 0
#  2013 4 1 1 1 0 0 0 0 0
#  2014 1 1 0 0 1 0 2 1 1

edited Aug 30 '17 at 06:39

answered Aug 30 '17 at 06:23

akrun

874,273
37
540
662

score 1 · Answer 2 · answered Aug 30 '17 at 06:22

You are close to the solution. You just need one more step. You need to melt all values in one column before dcast. See the example.

require(reshape2)

df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014),
                 c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i"))
names(df) <- c("year", "type")
df

df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5')))
df2

df3 <- melt(df2, id.vars = "df.year", na.rm = T)
df3

df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length)
df4

score 1 · Answer 3 · answered Aug 30 '17 at 06:48

Here's a data.table approach:

library(data.table)
setDT(df)
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year], 
 year ~ V1, value.var = "V1", fun.aggregate = length)
#   year a b c d e f g h i
#1: 2012 2 1 3 2 1 1 0 0 0
#2: 2013 4 1 1 1 0 0 0 0 0
#3: 2014 1 1 0 0 1 0 2 1 1

We first split the type column by comma and per year-group to a long-format, then dcast to wide with the length as aggregate function.

score 0 · Answer 4 · answered Aug 30 '17 at 06:12

Maybe, something like this could work?

# extract unique values and years
    vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]")))
    years <- unique(df$year)

# count
    df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){ 
      sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl)
    }))})))
    df4 <- cbind(years, df4)
    df4
#result
  years a b c d e f g h i
1  2012 2 1 3 2 1 1 0 0 0
2  2013 4 1 1 1 0 0 0 0 0
3  2014 1 1 0 0 1 0 2 1 1

Aggregating multiple columns from data frame

4 Answers4

Related