1

Would like to create a new data frame in R, which takes a set of rows, and combines each variation in nrow * nrow * ncol format.

library(dplyr)
dat <- read.table(text =
        " Animal Color Size
          Cat Orange 10
          Dog Black 20", header=TRUE)

Would like this output:

Animal  Color   Size 
Cat     NA      NA
Cat     Orange  NA
Cat     Orange  10
Dog     NA      NA
Dog     Black   NA
Dog     Black   20

Is there a function in R which can do this -- something like expand.grid?

expand.grid(dat$Animal, dat$Color, dat$Size) %>% arrange(Var1, Var2, Var3) #Note: this does not give the correct answer.

I'm able to create the first chunk of dat's first row using:

dat <- c("Cat", "Orange", 10)

counter <- 1
datInner <- list()
for(i in 1:length(dat)){ # loops through 3x

  # i <- 3
   datInner[[i]] <- dat[1:i]
   counter <- counter + 1

}

library(plyr)
# Adapted from http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column
plyr::rbind.fill(lapply(datInner, function(y){as.data.frame(t(y),
            stringsAsFactors = FALSE)}))

    # V1     V2   V3
    # 1 Cat   <NA> <NA>
    # 2 Cat Orange <NA>
    # 3 Cat Orange   10

NOTE: Will call this function type a Sequential Tree Extended Matrix (STEM). It takes a table with a tree where node depths vary, listing end nodes only, and converts it in to a table with all sequential combinations of the tree.

eyeOfTheStorm
  • 351
  • 1
  • 5
  • 15

3 Answers3

2

A dplyr solution - not very general.

library(dplyr)
rbind(
  dat  %>%  
    group_by(Animal) %>%
    summarize(Color = NA, Size = NA) %>%
    ungroup(),
  dat %>%
    group_by(Animal, Color) %>%
    summarize(Size = NA) %>%
    ungroup(),
  dat) %>% arrange(Animal)

#  Animal  Color  Size
#1    Cat   <NA>    NA
#2    Cat Orange    NA
#3    Cat Orange    10
#4    Dog   <NA>    NA
#5    Dog  Black    NA
#6    Dog  Black    20
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19
  • In the spirit of this question, given that your answer is with dplyr, I will give you the check mark for making this adaptable to a varying column environment. As it stands, it looks like the user would need to include a bunch of extra `group_by` statements with specific names of each column. – eyeOfTheStorm Jan 20 '17 at 22:43
1

there should be much efficient answers than this, this just a try!

m <- t(sapply(1:ncol(dat), function(i) c(1:i, rep(NA, (ncol(dat)-i)))))
m
#     [,1] [,2] [,3]
#[1,]    1   NA   NA
#[2,]    1    2   NA
#[3,]    1    2    3

#2. now i apply each row to original data (dat) Basically performing subset
m1 <- apply(dat, 1, function(i) apply(m, 1, function(j) i[j]))
data.frame(matrix(m1, byrow = T, ncol = ncol(dat)))
#   X1     X2   X3
#1 Cat   <NA> <NA>
#2 Cat Orange <NA>
#3 Cat Orange   10
#4 Dog   <NA> <NA>
#5 Dog  Black <NA>
#6 Dog  Black   20

Note : the last column is a factor because of having it in a matrix

joel.wilson
  • 8,243
  • 5
  • 28
  • 48
0

It turns out that a for loop can solve this with a list much easier than I initially thought, and it is generalizable to nrows with varying node depths. It is the same speed as joel's excellent answer with the two row example. However, this can currently be parallelized for much quicker reads outside of using Matrix. Note: Both Joel's and my own answer here will require a unique if there are varying node depths -- e.g., NA in place of 20 value in the dat table.

library(dplyr)
datInner <- list()
for(i in 1:ncol(dat)){ datInner[[i]] <- dat[1:i] }; # foreach %dopar% for parallel
datInner %>% bind_rows  
eyeOfTheStorm
  • 351
  • 1
  • 5
  • 15