3

I have a data frame that has a bunch of data that's joined with commas in certain elements of the rows. Something that looks like:

df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014)
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")
)
 names(df) <- c("year", "type")

I want to get it in a form that dcast is close to getting it to, with the year,a,b,c,etc being the columns, and the frequency across the data frame being in the cells of the resultant data frame. I tried first to do colsplit on df and then use dcast after, but that seems to only work if I want to aggregate on one of the levels instead of all.

 df2 <- data.frame( df$year,  colsplit(df$type, ',' , c('v1','v2','v3','v4','v5')) )
 df3 <- dcast(df2, df.year ~ v1)

This result only gives me for the first level of the colsplit, instead of all of them. Am I close to a solution or should I be using a different approach entirely?

AI52487963
  • 1,253
  • 2
  • 17
  • 36

4 Answers4

3

Here is a single line option with base R by splitting the 'type' column with strsplit, then set the names of the list output as 'year', stack it to a single data.frame and get the frequency count using table

table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1])
#        values
#ind    a b c d e f g h i
#  2012 2 1 3 2 1 1 0 0 0
#  2013 4 1 1 1 0 0 0 0 0
#  2014 1 1 0 0 1 0 2 1 1
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You are close to the solution. You just need one more step. You need to melt all values in one column before dcast. See the example.

require(reshape2)

df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014),
                 c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i"))
names(df) <- c("year", "type")
df

df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5')))
df2

df3 <- melt(df2, id.vars = "df.year", na.rm = T)
df3

df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length)
df4
djhurio
  • 5,437
  • 4
  • 27
  • 48
1

Here's a data.table approach:

library(data.table)
setDT(df)
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year], 
 year ~ V1, value.var = "V1", fun.aggregate = length)
#   year a b c d e f g h i
#1: 2012 2 1 3 2 1 1 0 0 0
#2: 2013 4 1 1 1 0 0 0 0 0
#3: 2014 1 1 0 0 1 0 2 1 1

We first split the type column by comma and per year-group to a long-format, then dcast to wide with the length as aggregate function.

talat
  • 68,970
  • 21
  • 126
  • 157
0

Maybe, something like this could work?

# extract unique values and years
    vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]")))
    years <- unique(df$year)

# count
    df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){ 
      sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl)
    }))})))
    df4 <- cbind(years, df4)
    df4
#result
  years a b c d e f g h i
1  2012 2 1 3 2 1 1 0 0 0
2  2013 4 1 1 1 0 0 0 0 0
3  2014 1 1 0 0 1 0 2 1 1
Damiano Fantini
  • 1,925
  • 9
  • 11