0

In the scenario I have multiple data frames (~100, numbers may vary) but all with the same size. They are basically indicators and I need to take intersection of them all. See the following code:

df1 <- data.frame(col1=c("a","b","c","d"),col2=c(NA,NA,NA,NA),col3=c(NA,"X",NA,"X"),col4=c("X",NA,NA,"X"))
df2 <- data.frame(col1=c("a","b","c","d"),col2=c("X","X",NA,NA),col3=c(NA,NA,NA,"X"),col4=c(NA,NA,NA,NA))
df3 <- data.frame(col1=c("a","b","c","d"),col2=c(NA,NA,"X",NA),col3=c(NA,NA,NA,NA),col4=c(NA,"X",NA,NA))

What I need to create is the output data frame that would contain X if at least one data frame contained X in this cell:

output <- data.frame(col1=c("a","b","c","d"),col2=c("X","X","X",NA),col3=c(NA,"X",NA,"X"),col4=c("X","X",NA,"X"))

I can do this with nested loops but there must be some clever fast way to achieve this result.

scyth
  • 1
  • 3

3 Answers3

0

Something like this?

    dfs <- list(df1, df2, df3)
    index <- lapply(dfs, function(x) apply(x[,2:4], 1, function(y) all(is.na(y))))
    output2 <- list()
    for(i in 1:length(dfs)){
      output2[[i]] <- dfs[[i]][!index[[i]],]
    }
    output <- do.call(rbind, output)
Mislav
  • 1,533
  • 16
  • 37
  • If dfs is the vector containing the names of my data frames, then I get an error when running index line: `Error in `[.default`(x, , 2:4) : incorrect number of dimensions` – scyth May 28 '18 at 20:06
  • dfs is list of dataframes. You can load them with dfs <- lapply(list.files(yourpath), read.csv) – Mislav May 28 '18 at 20:12
0

Say we start with an empty df whose columns we'll fill:

out <- data.frame(col1=c("a","b","c","d"), col2=NA, col3=NA, col4=NA)

Here's one way you could construct a single column with the desired property:

out$col2 <- sapply(1:nrow(out), function(r){
  ifelse(sum(!is.na(c(df1$col2[r], df2$col2[r], df3$col2[r]))) == 0, NA, "X")
})

So abstracting over the column, we could write a function like this:

make_output_column <- function(cname){
  sapply(1:nrow(out), function(r){
    values <- c(df1[[cname]][r], df2[[cname]][r], df3[[cname]][r])
    ifelse(sum(!is.na(values)) == 0, NA, "X")
  })
}

And then apply it across all the columns we want to construct, to create the desired output:

cols <- c("col2", "col3", "col4")
out[, cols] <- lapply(cols, make_output_column)

## col1 col2 col3 col4
##    a    X <NA>    X
##    b    X    X    X
##    c    X <NA> <NA>
##    d <NA>    X    X
lefft
  • 2,065
  • 13
  • 20
  • Thank you. Is there a way to provide a list of data frames instead of listing them explicitly in this line of code: `values <- c(df1[[cname]][r], df2[[cname]][r], df3[[cname]][r])` – scyth May 28 '18 at 20:04
  • certainly! yep, where `df_list <- list(df1, df2, df3)`, you could use `sapply(df_list, function(df) df[[cname]][r])` (maybe set `USE.NAMES=FALSE` in sapply call too) – lefft May 28 '18 at 20:14
  • @scyth See my post as well as the post that I link to in order to get a more extensible method for putting data.frames into a list. – lmo May 28 '18 at 20:16
0

Here is a technique with do.call and pmax

# put your data.frames into a list
myList <- mget(ls(pattern="df\\d"))

See my answer to this post for more details on this line.

cbind(myList[[1]][1], do.call(function(...) pmax(..., na.rm=TRUE),
                              lapply(myList, "==", "X"))[, 2:4])
  col1 col2 col3 col4
1    a    1   NA    1
2    b    1    1    1
3    c    1   NA   NA
4    d   NA    1    1

Here, myList[[1]][1] pulls off the first column from one of the data.frames in the list to return a data.frame with a single column. You could equivalently use df[1]. lapply(myList, "==", "X") runs through the list of data.frames and returns values of TRUE and FALSE for if X is included a cell in the data. Then do.call uses pmax to return the maximum value of each cell for the list of TRUEs and FALSEs and NAs.

If its super important to have "X"s instead of 1s, then follow this with

dat[dat == 1] <- "X"
lmo
  • 37,904
  • 9
  • 56
  • 69