-1

What have I done wrong here. Im trying to apply the following 2 lines to a loop using a vector of data frame names

df[5:length(df)][!is.na(df[5:length(df)])] <- 1
df[5:length(df)][is.na(df[5:length(df)])] <- 0

namelist is a vector of 12 df names

for(i in namelist){
 i[5:length(i)][!is.na(i[5:length(i)])] <- 1
 i[5:length(i)][is.na(i[5:length(i)])] <- 0
  }

Variables 1:4 in all of the data frames are to be kept but I want the rest as a binary (na = 0 , else 1) but the size of each data frame can vary (obs and vars).

Does not have to be a fast solution as this is a small data set

user2426619
  • 33
  • 1
  • 6
  • 3
    A list of data frame names is just a character vector; you'd have to use `get`/`mget`/the like to get the objects. That said, this is a bad idiom; [just store a list of data frames from the start](https://stackoverflow.com/a/24376207/4497050) and use `lapply`/the like. – alistaire Dec 26 '17 at 22:07
  • 1
    you need to set up a list, with each of your dataframes as an element of the list. Then you should use `lapply` as @alistaire mentioned. That's the best way to do this. – Matt W. Dec 26 '17 at 22:12

1 Answers1

1

Here is an approach that generates a list of data frames containing uniform random numbers, and processes it with lapply() as proposed in the OP comments. Instead of using is.na() to set TRUE vs FALSE, we use > 0.5 to create results data frames because data frames created as matrices of runif() values won't have missing values.

Note that is.na() can be used to set the entire output data frame to TRUE or FALSE values. No second pass of the data is required for !is.na().

Also note that the solution randomly assigns the number of columns in a data frame, so one can see that the solution does not require knowledge of the number of columns in each data frame.

Finally, to illustrate how to process a subset of the columns rather than the entire input data frame, we include logic to bind the first 4 columns of the original data with the columns of logicals.

set.seed(95014123)
dataList <- lapply(1:5,function(x) {
     columnCount <- sample(6:10,1)
     data.frame(matrix(runif(10*columnCount),nrow=10,ncol=columnCount))
})

# recode to binary based on whether values are > 0.5
resultList <- lapply(dataList,function(x) {
     recodedCols <- as.data.frame(x[,5:ncol(x)] > .5)
     colNames <- names(x[,5:ncol(x)])
     names(recodedCols) <- colNames
     cbind(x[,1:4],recodedCols)
 })

 # count sum of TRUEs across data tables
 unlist(lapply(resultList,function(x){
     sum(colSums(x[,5:ncol(x)]))
 }))

...and the output:

> unlist(lapply(resultList,function(x){
+      sum(colSums(x[,5:ncol(x)]))
+ }))
[1] 27 20 22 27 17
>

UPDATE: Here is a solution that generates a random percentage of NA values and uses is.na() to create the result data frames.

set.seed(95014123)
dataList <- lapply(1:5,function(x) {
     columnCount <- sample(6:10,1)
     pctMissing <- sample(c(0.1,0.2,0.3,0.4,0.5),1)
     dataValues <- runif(10*columnCount)
     missingIds <- sample(1:(10*columnCount),
                          size=(pctMissing*10*columnCount)) 
     dataValues[missingIds] <- NA
     data.frame(matrix(dataValues,nrow=10,ncol=columnCount))
})

resultList <- lapply(dataList,function(x) {
     recodedCols <- as.data.frame(is.na(x[,5:ncol(x)])) 
     colNames <- names(x[,5:ncol(x)])
     names(recodedCols) <- colNames
     cbind(x[,1:4],recodedCols)
})

# count sum of TRUEs across data tables
unlist(lapply(resultList,function(x){
     sum(colSums(x[,5:ncol(x)]))
}))

...and the output:

> unlist(lapply(resultList,function(x){
+      sum(colSums(x[,5:ncol(x)]))
+ }))
[1] 23 16  9  1 17
> 
Len Greski
  • 10,505
  • 2
  • 22
  • 33