1

sorry if this has an obvious answer. i'm trying to perform a reshape that has lots of stackoverflow answers when only one column gets used or when the column names can be hardcoded, but i need an answer that will work dynamically when the ordered.cols and unique.cols vectors are not set from the beginning

# these two sets of columns need to be dynamic
# they might be any two sets of columns!
ordered.cols <- c( 'cyl' , 'gear' )
unique.cols <- c( 'am' , 'vs' )
# neither of the above two character vectors will be known beforehand


# so here's the example starting data set
x <- mtcars[ , c( ordered.cols , unique.cols ) ]

# the desired output should have this many records:
unique( x[ , ordered.cols ] )
# but i'm unsure of the smartest way to add the additional columns that i want--


# for *each* unique level in *each* of the variables in
# `unique.cols` there should be one additional column added
# to the final output.  then, for that `ordered.cols` combination
# the cell should be populated with the value if it exists
# and NA otherwise
desired.output <-
    structure(list(cyl = c(4L, 4L, 4L, 6L, 6L, 6L, 8L, 8L), gear = c(3L, 
    4L, 5L, 3L, 4L, 5L, 3L, 5L), am1 = c(0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 1L), am2 = c(NA, 1L, NA, NA, 1L, NA, NA, NA), vs1 = c(1L, 
    1L, 0L, 1L, 0L, 0L, 0L, 0L), vs2 = c(NA, NA, 1L, NA, 1L, NA, 
    NA, NA)), .Names = c("cyl", "gear", "am1", "am2", "vs1", "vs2"
    ), class = "data.frame", row.names = c(NA, -8L))


desired.output

i don't really care if the new columns are named am1, am2, vs1, vs2, or something more convenient. but if there are two distinct values of am in the data, there need to be two data-holding columns in the final output, one of which should be missing if that combination doesn't have the value.

# second example #
ordered.cols <- c( 'a' , 'b' )
unique.cols <- 'd'

# starting data set
y <-
    data.frame( 
        a = c( 1 , 1 , 1 , 2 ) , 
        b = c( 1 , 1 , 2 , 2 ) , 
        d = c( 'z' , 'y' , 'x' , 'x' ) 
    )

# the desired output here should have this many rows..
unique( y[ , ordered.cols ] )
# now the contents of all columns in `unique.cols`
# (which in this case is only column `d`)
# need to be appended as a widened data set

second.desired.output <-
    data.frame( 
        a = c( 1 , 1 , 2 ) ,
        b = c( 1 , 2 , 2 ) ,
        d1 = c( 'z' , 'x' , 'x' ) ,
        d2 = c( 'y' , NA , NA )
    )

second.desired.output

thanks!!!!!!

Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • I would have called this reshaping to "wide", and I'm a bit unclear which columns are supposed to be known by name (or not) at the beginning.b Also there were 6 lines of data in 'x' that has cyl and gear = 4 with am= 1, so how does that info get compacted? – IRTFM Aug 01 '14 at 21:51
  • @BondedDust agree with your title suggestion :) the columns will be in a character vector supplied, but they'll have to be extracted somehow so something [like this](http://stackoverflow.com/questions/14800161/how-to-find-the-top-n-values-by-group-or-within-category-groupwise-in-an-r-dat/14800163#14800163) will not work because that x$mpg is hardcoded.. – Anthony Damico Aug 01 '14 at 21:55
  • Well that's easy. Use "[[" instead of "$". It will accept a character argument. – IRTFM Aug 01 '14 at 21:56
  • @BondedDust ..but not multiple columns at once? and even if i can rig tapply to provide the unique levels of each column within multiple-column groups, adding the wide columns is a pretty cumbersome process? i have gotta be missing something obvious here, but i can't figure out what. – Anthony Damico Aug 01 '14 at 22:00
  • If I knew what was to be done with one column I'm fairly sure I could answer the question. Still can't figure out the rule to be applied to compress 6 lines of data to a value of `1`. Is it just "existence" of any? – IRTFM Aug 01 '14 at 22:01
  • @BondedDust yes, i'm sorry for being unclear..it's just the existence of any. i've added a second example to hopefully clarify. thanks so much for offering your time to help me!! :) – Anthony Damico Aug 01 '14 at 22:13
  • `aggregate` could be one (ugly) way to go. For your `mtcars` example you could use `aggregate(cbind(am, vs) ~ cyl + gear , data=mtcars, function(i) unique(identity(i)))` and your minimal example you could use `aggregate(d~ a + b, data=y, function(i) unique(identity(i)))`. All you need to do then is `strsplit` the final column. [your column `d` in the first example should be character] – user20650 Aug 01 '14 at 22:55
  • @user20650 thanks but you are hard-coding the column names? -- i need this run on different data.frame objects where a, b, and d will be different at every iteration of the loop. – Anthony Damico Aug 01 '14 at 22:58
  • Maybe you can do something along the lines of this: `aggregate(dat[, unique.cols, drop=F] , by=dat[, ordered.cols] , function(i) unique(identity(i)))`. Of course still needs tweaking. – user20650 Aug 01 '14 at 23:17
  • @user20650 wow awesome! this works quite well if the leftover columns are numeric! if they are character, it should be this: `aggregate(x[, unique.cols, drop=F] , by=x[, ordered.cols] , function(i) as.character(unique(i)))` could you add this as an answer so i can +1? thanks! – Anthony Damico Aug 01 '14 at 23:32
  • @Anthony Damico. You could use `cSplit ` along with the `aggregate` showed by user20650. For the first case: `cSplit(aggregate(x[, unique.cols, drop=F] , by=x[, ordered.cols] , function(i) paste(sort(unique(i)), collapse=",")), c("am", "vs"), "wide", sep=",")[order(cyl)]` and for second `cSplit(aggregate(y[, unique.cols, drop=F] , by=y[, ordered.cols] , function(i) paste(sort(unique(i)), collapse=",")), "d", "wide", sep=",")[order(a)]`. Here I used `sort`. – akrun Aug 02 '14 at 06:49
  • @akrun this is fantastic! could you make this an answer? is cSplit part of some package? i searched around and only found it here: https://gist.github.com/mrdwab/11380733 – Anthony Damico Aug 02 '14 at 07:00
  • @Anthony Damico, I will modify my deleted answer – akrun Aug 02 '14 at 07:04

2 Answers2

2
library(data.table)

reshapeMyData <- function(x, unique.cols, ordered.cols, NA_class="integer") {
  DT <- as.data.table(x)

  unique.values <- lapply(DT[, unique.cols, with=FALSE], unique)

  ## If your NA is of the wrong class, it can potentially throw an error, 
  ##    depending on when it first shows up.  It is better to be explicit about the expected class
  NA.classed <- as(NA, NA_class)

  ###  -- This is all one line.. it iterates over the unique combinations of ordered.cols values
  DT[, {browser(expr=FALSE)
    ## These three functions shape the data as needed
    setDT(as.list(unlist(
      ## This mapply call checks if each value is in the given group
      mapply(function(v, C) {ifelse(v %in% C, v, NA.classed)}, v=unique.values, C=.SD, SIMPLIFY=FALSE)
    )))
  }
  , keyby=ordered.cols, .SDcols=unique.cols]

} ## // end function reshapeMyData

OUTPUT

reshapeMyData(x, unique.cols, ordered.cols)

   cyl gear am1 am2 vs1 vs2
1:   4    3  NA   0  NA   1
2:   4    4   1   0  NA   1
3:   4    5   1  NA   0   1
4:   6    3  NA   0  NA   1
5:   6    4   1   0   0   1
6:   6    5   1  NA   0  NA
7:   8    3  NA   0   0  NA
8:   8    5   1  NA   0  NA

reshapeMyData(y, "d", c("a", "b"), NA_class="character")

   a b d1 d2 d3
1: 1 1  z  y NA
2: 1 2 NA NA  x
3: 2 2 NA NA  x
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • thank you! this is pretty genius. is there an easy way to condense/left align the columns? (in my second example, the two `x`s are in d1 and the column d3 doesn't exist) – Anthony Damico Aug 02 '14 at 06:18
0

for my purposes, this solution seems to work well:

aggregate( x[ , unique.cols ] , by = x[ , ordered.cols ] , function( w ) paste( sort( unique( w ) ) , collapse = "," ) )

aggregate( y[ , unique.cols ] , by = y[ , ordered.cols ] , function( w ) paste( sort( unique( w ) ) , collapse = "," ) )

sometimes (i am not sure why, but i think it's a factor-coercion issue) nrow( unique( x[ , ordered.cols ] ) ) does not equal the nrow of the output of the above commands. in those cases, this workaround seems to do the trick:

halfway <- aggregate( x[ , unique.cols ] , by = list( apply( x[ , ordered.cols ] , 1 , paste , collapse = "" ) ) , function( w ) paste( sort( unique( w ) ) , collapse = "," )  )

cbind( unique( x[ , ordered.cols ] ) , halfway[ , -1 ] )
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77