3

I have a function that takes as input a dataframe with certain columns

columns =['a', 'b',...,'z']

Now I have a dataframe DF with only few of these columns DF_columns = ['f', 'u', 'z']

How can I create a dataframe that has all the columns with value NA if the columns are not in DF and that coincides with DF on the columns ['f', 'u', 'z']

Example:

d = data.frame('g'=c(1,2,3), 's' = c(4,2,3))
columns = letters[1:21]
columns
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u"

> d
  g s
1 1 4
2 2 2
3 3 3
> 
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Donbeo
  • 17,067
  • 37
  • 114
  • 188

6 Answers6

3
x.or.na <- function(x, df) if (x %in% names(df)) df[[x]] else NA
as.data.frame(Map(x.or.na, columns, list(d)))
flodel
  • 87,577
  • 21
  • 185
  • 223
2
set.seed(42)
 DF <- setNames(as.data.frame(matrix(sample(1:15, 15, replace=TRUE), ncol=3)), c('f', 'u', 'z') )

  DF
  #  f  u  z
  #1 14  8  7
  #2 15 12 11
  #3  5  3 15
  #4 13 10  4
  #5 10 11  7

 res <- do.call(`data.frame`,lapply(split(letters[4:26], letters[4:26]), 
       function(x){x1 <- match(x, colnames(DF)); if(!is.na(x1)) DF[,x1] else NA}))

 res    
 #  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
 #1 NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  8 NA NA NA NA  7
 #2 NA NA 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12 NA NA NA NA 11
 #3 NA NA  5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA NA NA 15
 #4 NA NA 13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10 NA NA NA NA  4
 #5 NA NA 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11 NA NA NA NA  7

Using dplyr

 library(dplyr)
   DF %>% 
   do({x1 <-data.frame(., setNames(as.list(rep(NA, sum(!letters[4:26] %in% names(DF)))), 
  setdiff(letters[4:26], names(DF))))
    x1[,order(colnames(x1))] })    
  #  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
 #1 NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  8 NA NA NA NA  7
 #2 NA NA 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12 NA NA NA NA 11
 #3 NA NA  5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA NA NA 15
 #4 NA NA 13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10 NA NA NA NA  4
 #5 NA NA 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11 NA NA NA NA  7
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Inside your `ifelse`, `length(letters) %in% colnames(DF)` is not equal to `length(DF)` so it is just luck if you get the expected result after recycling. See about replacing `letters` with `letters[4:26]` for example. `cbind` also forces all data to the same type, not a good thing IMHO. – flodel Aug 03 '14 at 16:35
  • @flodel, Oops, didn't notice that. Thank you very much for pointing that out. – akrun Aug 03 '14 at 16:42
  • still something wrong (sorry...), try adding `DF$k <- TRUE`. – flodel Aug 03 '14 at 16:54
  • your solution makes an assumption about the order of the columns in `DF`; I suggested to add DF$k after it was created so `k` is the last column, but you can try `DF <- rev(DF)` to test. Also the column of TRUE is converted to `1` so moving away from `cbind` did not help. – flodel Aug 03 '14 at 17:00
2

Here are a few methods and their timings.

createDF1 <- function(colVec, data)
{
    m <- matrix(, nrow = nrow(data), ncol = length(colVec), 
                dimnames = list(NULL, colVec))
    m[, names(data)] <- as.matrix(data)
    data.frame(apply(m, 2, as.numeric))
} 

createDF2 <- function(colVec, data)
{
    rr <- setNames(rep(list(rep(NA_integer_, nrow(data))), length(colVec)),  .
                   nm = colVec)
    rr[match(names(data), colVec)] <- data
    as.data.frame(rr)
}

createDF3 <- function(colVec, data)
{
    rr <- setNames(replicate(length(colVec),  
                             list(rep(NA_integer_, nrow(data)))),  
                   nm = colVec)
    rr[match(names(d), colVec)] <- data
    as.data.frame(rr)
}

Create a 3,000,000 x 3 data frame to test on:

columns <- letters[1:21]
d <- data.frame(g = 1:3e6L, s = 1:3e6L, j = 1:3e6L)

Run some tests:

system.time({ createDF1(columns, d) })
#  user  system elapsed 
# 5.022   1.023   6.054  
system.time({ createDF2(columns, d) })
#  user  system elapsed 
# 0.007   0.004   0.011 
system.time({ createDF3(columns, d) })
#  user  system elapsed 
# 0.105   0.077   0.183

Of these three, it looks like rep(list(rep(NA_integer_, nrow(data))), length(columns)) is the way to go, and replace values from that.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • `createDF2` is fast because each element of the list points to the same value. The allocation is made only for the first time. The rest is shallow copied. Do: `.Internal(inspect(rep(list(rep(NA, 10)), 5)))` and check the address. Everyone of them is identical. It's not really fast, it's just postponing the time consuming operation (memory allocation) to the next step. It's advantageous if you dint do anything with it (directly write to file). But I suspect it's that often the case. – Arun Aug 04 '14 at 17:27
  • I did a `Rprof` on it and there's not much except for `print` and it's methods. It's fast at least partly because I created the new structure with `NA_integer_` as opposed to `NA` (logical). – Rich Scriven Aug 04 '14 at 18:09
  • Well, maybe..maybe not. :-) – Rich Scriven Aug 04 '14 at 18:14
  • I'm not sure what you're trying to say with `Rprof`. The allocation of NAs happen only once (i.e., a vector of size 3 million integers is allocated, once). Not 21 times. So I'm not surprised with what little `Rprof` showed. And, what exactly do you think makes the code faster by your changing of `NA` to `NA_integer_`? Both of them take 4-bytes for storage. – Arun Aug 04 '14 at 18:22
  • Check the time for: `system.time(rep(NA_integer_, 3e6))` vs `system.time(rep(list(rep(NA_integer_, 3e6)), 300L))`.. It takes 0.007 vs 0.007 on my system.. how can that happen unless they're *just* shallow copied... :) (as R v3.1+ does). – Arun Aug 04 '14 at 18:25
  • @Arun - I'm just going off what Duncan TL tought me in college, Haha. We never really got into memory and storage though, which is a shame since `microbenchmark` seems to be the most-called function in the SO R-tag. :) It's all a learning process. – Rich Scriven Aug 04 '14 at 18:30
2

This is quite easy (in terms of syntax) and efficient (in terms of speed) using the data.table package:

require(data.table) ## 1.9.2+
setDT(d)[, setdiff(columns, names(d)) := NA] ## (1)
setcolorder(d, columns) ## (2)
setDF(d) ## (3)
  1. setDT converts d to a data.table, after which we use the := operator to create new columns by reference. There are many ways to use :=, but highlighted here is the use case LHS := RHS. Here LHS is a vector of column names and RHS is the value. NA is provided only once on the RHS, which gets automatically recycled for all other columns. Note that NA by default is logical type in R.
  2. If required you can reorder the columns of d in the same order as columns using setcolorder.
  3. Again, if necessary, you can convert the data.table back to a data.frame, using the function setDF, which again modifies the object by reference. But it's available in the development version v1.9.3 only for now.
Arun
  • 116,683
  • 26
  • 284
  • 387
0

Setup:

set.seed(1)
DF_all <- setNames(data.frame(matrix(rnorm(5*26), nrow=5, ncol=26)), letters)
DF <- DF_all[, c('f','u','z')]

Create a new empty dataframe and populate with your columns:

DF2 <- setNames(data.frame(matrix(nrow=5, ncol=26)), letters)
DF2[, c('f','u','z')] <- DF[, c('f','u','z')]

Result:

> DF2
   a  b  c  d  e           f  g  h  i  j  k  l  m  n  o  p  q  r  s  t           u  v  w  x  y           z
1 NA NA NA NA NA -0.05612874 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.62036668 NA NA NA NA  0.71266631
2 NA NA NA NA NA -0.15579551 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  0.04211587 NA NA NA NA -0.07356440
3 NA NA NA NA NA -1.47075238 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.91092165 NA NA NA NA -0.03763417
4 NA NA NA NA NA -0.47815006 NA NA NA NA NA NA NA NA NA NA NA NA NA NA  0.15802877 NA NA NA NA -0.68166048
5 NA NA NA NA NA  0.41794156 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.65458464 NA NA NA NA -0.32427027
Thomas
  • 43,637
  • 12
  • 109
  • 140
0

[<- could be used to fill up the missing columns with NA.

`[<-`(d,, setdiff(columns, names(d)), NA)[columns]
#`[<-`(d,, columns[!columns %in% names(d)], NA)[columns] #Alternative
#   a  b  c  d  e  f g  h  i  j  k  l  m  n  o  p  q  r s  t  u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA

Or directly adding the missing columns to the original data.frame

d[columns[!columns %in% names(d)]] <- NA
d[columns]
#   a  b  c  d  e  f g  h  i  j  k  l  m  n  o  p  q  r s  t  u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA

Or in a function:

f <- function(DF, COL) {
  d[columns[!columns %in% names(d)]] <- NA
  d[columns]
}
f(d, columns)
#   a  b  c  d  e  f g  h  i  j  k  l  m  n  o  p  q  r s  t  u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA

Data

d <- data.frame('g'=c(1,2,3), 's' = c(4,2,3))
columns <- letters[1:21]
GKi
  • 37,245
  • 2
  • 26
  • 48