1

I can't seem to find this specifically (I looked here: How to split a character vector into data frame?) and a few other places.

I am trying to split a character vector in R into a data frame, with a set number of columns, filling in NA for any extras or missing. As below (reproducible):

###Reproduce column vector
cv <- c("a1", "b1", "c1", "d1", "e1", "f1", "aa2", "bb2", "cc2", "dd2", "ee2", "ff2", "x1", "x2", "x3", "x4", "x5", "x6", "rr2", "tt3", "bb4")

###Desired data frame separating 6 columns
df.desired <- data.frame(col1=c("a1","aa2","x1","rr2"),col2=c("b1","bb2","x2","tt3"),col3=c("c1","cc2","x3","bb4"),col4=c("d1","dd2","x4",NA),col5=c("e1","ee2","x5",NA),col6=c("f1","ff2","x6",NA),stringsAsFactors = F)

Thanks in advance!

Neal Barsch
  • 2,810
  • 2
  • 13
  • 39
  • 1
    Can you please be more explicit about the rule for allocating `NA` to different columns. Cheers – Henrik Jul 05 '18 at 21:28

2 Answers2

4

1) base Create a matrix of NA values of the requisite dimensions and then fill it with cv up to its length. Transpose that and convert to a data frame.

mat <- t(replace(matrix(NA, 6, ceiling(length(cv) / 6)), seq_along(cv), cv))
as.data.frame(mat, stringsAsFactors = FALSE)

2) another base solution Using the cv2 copy of cv expand its length to that required and then reshape it into a matrix. We used cv2 in order to preserve the original cv but if you don't mind adding NAs to the end of cv then you could just use it instead of creating cv2 reducing the code by one line (two lines if we can use mat rather than needing a data frame). This solution avoids needing to use transpose by making use of the byrow argument of matrix.

cv2 <- cv
length(cv2) <- 6 * ceiling(length(cv) / 6)
mat <- matrix(cv2,, 6, byrow = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)

3) base solution using ts This one gets the row and column indexes by extracting them from the times of a ts object rather than calculating the dimensions via numeric calculation. To do that create the times, tt, of a ts object from cv. tt itself is a ts object for which as.integer(tt) is the row index numbers and cycle(tt) is the column index numbers. Finally use tapply with that:

tt <- time(ts(cv, frequency = 6))
mat <- tapply(cv, list(as.integer(tt), cycle(tt)), c)
as.data.frame(mat, stringsAsFactors = FALSE)

4) rollapply Like (3) this one does not explicitly calculate the dimensions of mat. It uses rollapply in the zoo package with a simple function, Fillr to avoid this. The Fill function returns its argument x padded out with NAs on the right to a length of 6.

library(zoo)

Fill <- function(x) { length(x) <- 6; x }
mat <- rollapplyr(cv, 6, by = 6, Fill, align = "left", partial = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)

In all alternatives above omit the last line if a matrix mat is adequate as the result.

Added

As of R 4.0 stringsAsFaactors=FALSE is the default so it could be omitted above.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

1) base R - split the vector using a grouping variable created with gl and then append NA at the end with length<-

lst <- split(cv, as.integer(gl(length(cv), 6, length(cv))))
as.data.frame(do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
#  V1  V2  V3   V4   V5   V6
#1  a1  b1  c1   d1   e1   f1
#2 aa2 bb2 cc2  dd2  ee2  ff2
#3  x1  x2  x3   x4   x5   x6
#4 rr2 tt3 bb4 <NA> <NA> <NA>
akrun
  • 874,273
  • 37
  • 540
  • 662