How to assign values to a new column in R depending on values from other columns

Question

Creating a new column in my data frame of > 900,000 rows to which I want values to be

NA if any of the values in a set of columns are NA

df$newcol[is.na(df$somecol_1) | is.na(df$somecol_2) | is.na(df$somecol_3)] <- NA

0 if all of the values in a set of columns are 0

df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==0] <- 0

1 if any of the values in a set of columns are 1 while none is NA. This is the tricky part as it creates a myriad of combinations with my ten columns. The whole data frame has >50 columns, of which I have ten columns of interest for this procedure and here I present only three:

df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==0] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==1 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==0 & df$somecol_2==0 & df$somecol_3==1] <- 1
df$newcol[df$somecol_1==1 & df$somecol_2==0 & df$somecol_3==1] <- 1

I have a feeling I am overthinking this, there must be a way to make 3 easier? Writing different combinations of columns as shown above would take forever with ten. And a loop would go too slow due to the large dataset.

Dummy data:

df <- NULL
df$somecol_1 <- c(1,0,0,NA,0,1,0,NA,1,1)
df$somecol_2 <- c(NA,1,0,0,0,1,0,NA,0,0)
df$somecol_3 <- c(0,0,0,0,0,0,0,0,0,0)
df <- as.data.frame(df)

Based on the above, I want the new column to be

df$newcol <- c(NA,1,0,NA,0,1,0,NA,1,1)

Your dummy data doesn't work. You never create `df`, and `somecol_3` has 11 values while the others have 10. — Gregor Thomas, Oct 15 '20 at 16:49

akrun · Answer 1 · 2020-10-15T18:09:44.300

2

We can use rowSums

nm1 <- grep('somecol', names(df))
df$newcol <- NA^(rowSums(is.na(df[nm1])) > 0) *(rowSums(df[nm1], na.rm = TRUE) > 0)
df$newcol
#[1] NA  1  0 NA  0  1  0 NA  1  1

data

df <- structure(list(somecol_1 = c(1, 0, 0, NA, 0, 1, 0, NA, 1, 1), 
    somecol_2 = c(NA, 1, 0, 0, 0, 1, 0, NA, 0, 0), somecol_3 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-10L))

edited Oct 15 '20 at 18:09

answered Oct 15 '20 at 16:51

akrun

874,273
37
540
662

This would work on the whole data frame, not on a select number of columns which I need. I have clarified that in the question now. – pha Oct 15 '20 at 18:08
@pha Updated the post – akrun Oct 15 '20 at 18:09

Ricardo Semião e Castro · Answer 2 · 2020-10-15T18:38:16.760

df$newcol = ifelse(apply(df,1,sum)>=1,1,0)

Should do the trick. First, apply will sum every row, then: Whenever there is NA values in a row (1st case), any operation will return NA; When there are only 0's (2nd case), the sum is 0 (witch isn't >=1)and the ifelse third argument makes the new entry 0; And when there is at least one 1, the sum is equal or greater than 1 and the ifelse second argument makes the new entry 1.

Edit: As you want to run theese conditions only in some columns - say its columns 1-7, 9, and 23-24 - you can use the code only in that part of the df:

df$newcol = as.numeric(rowSums(df[,c(1:7,9,23:24)])>=1)

OBS: i used the simplified code that Akrun and Gregor have answered.

If you prefer, there are ways to select columns by name: Extracting specific columns from a data frame

Thanks but this would be applicable for the whole data frame. I am interested in a select number of columns in the data frame. I have clarified that in my question now. — pha, Oct 15 '20 at 18:11

score 2 · Answer 3 · answered Oct 15 '20 at 17:00

2

as.numeric(rowSums(df) >= 1) 
#[1] NA  1  0 NA  0  1  0 NA  1  1

rowSums will give NA if there are any missing values. It will be 0 if all the values are 0, it will be 1 otherwise (assuming your data is all either NA, 0, or 1).

(Using akrun's sample data)

answered Oct 15 '20 at 17:00

Gregor Thomas

136,190
20
167
294

1

It is great way. I didn't think about it – akrun Oct 15 '20 at 17:02
I believe `rowSums` applies to the whole row in a data frame. I am interested in using a select number of columns. Have edited that too in my question. – pha Oct 15 '20 at 18:05
@pha You can use `rowSums` (or any other method) on a subset. E.g., `cols_to_sum = c("somecol_1", "somecol_2", "somecol_3")`, then `as.numeric(rowSums(df[cols_to_sum]) >= 1)`. You just need to indicate either the names or indices of the columns you want considered. – Gregor Thomas Oct 16 '20 at 15:49
@pha you can manually enter the column indices, type the column names, construct the column names using `paste()`, use `grep` and regex patterns to extract the column names... whatever is easiest. – Gregor Thomas Oct 16 '20 at 15:56

How to assign values to a new column in R depending on values from other columns

3 Answers3

data