1

Creating a new column in my data frame of > 900,000 rows to which I want values to be

  1. NA if any of the values in a set of columns are NA
df$newcol[is.na(df$somecol_1) | is.na(df$somecol_2) | is.na(df$somecol_3)] <- NA
  1. 0 if all of the values in a set of columns are 0
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==0] <- 0
  1. 1 if any of the values in a set of columns are 1 while none is NA. This is the tricky part as it creates a myriad of combinations with my ten columns. The whole data frame has >50 columns, of which I have ten columns of interest for this procedure and here I present only three:
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==1] <- 1

I have a feeling I am overthinking this, there must be a way to make 3 easier? Writing different combinations of columns as shown above would take forever with ten. And a loop would go too slow due to the large dataset.

Dummy data:

df <- NULL
df$somecol_1 <- c(1,0,0,NA,0,1,0,NA,1,1)
df$somecol_2 <- c(NA,1,0,0,0,1,0,NA,0,0)
df$somecol_3 <- c(0,0,0,0,0,0,0,0,0,0)
df <- as.data.frame(df)

Based on the above, I want the new column to be

df$newcol <- c(NA,1,0,NA,0,1,0,NA,1,1)
pha
  • 316
  • 2
  • 9

3 Answers3

2

We can use rowSums

nm1 <- grep('somecol', names(df))
df$newcol <- NA^(rowSums(is.na(df[nm1])) > 0) *(rowSums(df[nm1], na.rm = TRUE) > 0)
df$newcol
#[1] NA  1  0 NA  0  1  0 NA  1  1

data

df <- structure(list(somecol_1 = c(1, 0, 0, NA, 0, 1, 0, NA, 1, 1), 
    somecol_2 = c(NA, 1, 0, 0, 0, 1, 0, NA, 0, 0), somecol_3 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
2
df$newcol = ifelse(apply(df,1,sum)>=1,1,0)

Should do the trick. First, apply will sum every row, then: Whenever there is NA values in a row (1st case), any operation will return NA; When there are only 0's (2nd case), the sum is 0 (witch isn't >=1)and the ifelse third argument makes the new entry 0; And when there is at least one 1, the sum is equal or greater than 1 and the ifelse second argument makes the new entry 1.

Edit: As you want to run theese conditions only in some columns - say its columns 1-7, 9, and 23-24 - you can use the code only in that part of the df:

df$newcol = as.numeric(rowSums(df[,c(1:7,9,23:24)])>=1)

OBS: i used the simplified code that Akrun and Gregor have answered.

If you prefer, there are ways to select columns by name: Extracting specific columns from a data frame

  • Thanks but this would be applicable for the whole data frame. I am interested in a select number of columns in the data frame. I have clarified that in my question now. – pha Oct 15 '20 at 18:11
2
as.numeric(rowSums(df) >= 1) 
#[1] NA  1  0 NA  0  1  0 NA  1  1

rowSums will give NA if there are any missing values. It will be 0 if all the values are 0, it will be 1 otherwise (assuming your data is all either NA, 0, or 1).

(Using akrun's sample data)

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    It is great way. I didn't think about it – akrun Oct 15 '20 at 17:02
  • I believe `rowSums` applies to the whole row in a data frame. I am interested in using a select number of columns. Have edited that too in my question. – pha Oct 15 '20 at 18:05
  • @pha You can use `rowSums` (or any other method) on a subset. E.g., `cols_to_sum = c("somecol_1", "somecol_2", "somecol_3")`, then `as.numeric(rowSums(df[cols_to_sum]) >= 1)`. You just need to indicate either the names or indices of the columns you want considered. – Gregor Thomas Oct 16 '20 at 15:49
  • @pha you can manually enter the column indices, type the column names, construct the column names using `paste()`, use `grep` and regex patterns to extract the column names... whatever is easiest. – Gregor Thomas Oct 16 '20 at 15:56