R - adding a new column based on binary data across many columns

Question

I cannot get my data frame to add an additional column. I have reviewed so many stack overflows, but here is a subset (Adding a new column in a matrix in R, adding new column to data frame in R, new column not added to dataframe in R,R: complete a dataset with a new column added, R: add a new column to dataframes from a function)

I need a single column that tells us if there is a positive or "1" in any of the viral rows I have.

I am trying to determine probability and from what I see, I will need this column to do further calculations, so please help if able!

Sample data

Filovirus (MOD) PCR   :    Phlebo (Sanchez-Seco) PCR
0                          0         
0                          1            
0                          0            
0                          0        
0                          0         
0                          0        
0                          0       
0                          0         
0                          0        
0                          0   


species code  forest site
<fctr>  <dbl> <fctr>
SM      1     UMNP-mangabey
SM      1     UMNP-mangabey
RC      9     UMNP-hondohondoc
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod

The closest I have gotten is getting base R to call which rows have the positive value

I followed the solution here but have yet to get it to work for me.

tmp=which(data==1,arr.ind=T)    
tmp=tmp[order(tmp[,"row"]),]
c("positive","negative")[tmp[,"col"]] -> data$new

Any advice is greatly appreciated.

Dput

structure(list(`Filovirus (MOD) PCR` = c("0", "0", "0", "0", 
"0", "0", "0", "0", "0", "0"), `Filovirus (A) PCR` = c("0", "0", 
"0", "0", "0", "0", "0", "0", "0", "0"), `Filovirus (B) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Filo C PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Filovirus (D) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Coronavirus   (Quan) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Coronavirus (Watanabe) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Paramyxo  (Tong)  PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Flavivirus Moureau PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Flavivirus  Sanchez-seco PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Arena Lozano 1 PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Retrovirus Courgnard PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Simian Foamy Goldberg (Pol) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Simian Foamy Goldberg (LTR Region) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Influenza (Anthony) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Influenza (Liang) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Rhabdo (CII) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Enterovirus CII I PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Enterovirus CII-II PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Alphav   (Sanchez-Seco) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Lyssavirus (Vasquez-Moron) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Seadornavirus (CII) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Hantavirus (Raboni) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Hantavirus (Klempa) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Nipah (Wacharapleusadee) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Henipa (Feldman) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Bunya S (Briese) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Bunya L (Briese) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Phlebo (Sanchez-Seco) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), species = structure(c(3L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("SM", "SY", "BWC", 
"YB", "RC"), class = "factor"), code = c(2, 5, 5, 5, 5, 5, 5, 
5, 5, 5), forestsite = structure(c(3L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L), .Label = c("Magombera1", "Magombera2", "NDUFR", 
"Ndundulu1", "Ndundulu2", "Ndundulu3", "Nyumbanitu", "UMNP-campsite3", 
"UMNP-hondohondoa", "UMNP-hondohondob", "UMNP-hondohondoc", "UMNP-hondohondod", 
"UMNP-hondohondoe", "UMNP-HQ", "MamaGoti", "UMNP-mangabey", "UMNP-njokamoni", 
"UMNP-Sanje1", "UMNP-Sanje2", "UMNP-Sanje3", "Sonjo", "SonjoRoad"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Please provide sample out by using `dput(head(x))`, it's much easier for us to use and test on and removes ambiguity. It also helps if you include the literal output you're expecting in that output. Thanks! — r2evans, Mar 04 '23 at 13:08

langtang · Answer 1 · 2023-03-04T14:37:23.117

Updated, given character columns, and new 32 column example

df["new"] = apply(df[, -c(29:32)], 1,\(x) ifelse(sum(as.numeric(x))>0, "positive", "negative"))

Original answer (assuming numeric columns):

You can simply do this:

df["new"] =ifelse(rowSums(df[,-(1:3)])>0, "positive", "negative")

Output:

   species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR      new
1       SM    1    UMNP-mangabey                   0                         0 negative
2       SM    1    UMNP-mangabey                   0                         1 positive
3       RC    9 UMNP-hondohondoc                   0                         0 negative
4      BWC    9 UMNP-hondohondod                   0                         0 negative
5      BWC    9 UMNP-hondohondod                   0                         0 negative
6      BWC    9 UMNP-hondohondod                   0                         0 negative
7      BWC    9 UMNP-hondohondod                   0                         0 negative
8      BWC    9 UMNP-hondohondod                   0                         0 negative
9      BWC    9 UMNP-hondohondod                   0                         0 negative
10     BWC    9 UMNP-hondohondod                   0                         0 negative

Input:

structure(list(species = c("SM", "SM", "RC", "BWC", "BWC", "BWC", 
"BWC", "BWC", "BWC", "BWC"), code = c(1L, 1L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L), forest_site = c("UMNP-mangabey", "UMNP-mangabey", 
"UMNP-hondohondoc", "UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod", 
"UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod"
), `Filovirus (MOD) PCR` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `Phlebo (Sanchez-Seco) PCR` = c(0, 
1, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-10L))

This works for your input, and I profusely apologize for not having my dput initially. I keep getting the error code ```Error in rowSums(maybed[, -(29:32)]) : 'x' must be numeric``` I get this even after I convert my entire data frame to a matrix. — Marnee Roundtree, Mar 04 '23 at 14:01

TarJae · Accepted Answer · 2023-03-04T14:06:18.930

Update: Your 0 and 1 are character type. Transforming to number with type.convert(as.is = TRUE) will make the code work:

library(dplyr)

df %>%
  type.convert(as.is=TRUE) %>% 
  mutate(new_column = if_else(rowSums(select(., contains("PCR"))) > 0, "positive", "negative"))

   Filovirus (…¹ Filov…² Filov…³ Filo …⁴ Filov…⁵ Coron…⁶ Coron…⁷ Param…⁸ Flavi…⁹ Flavi…˟ Arena…˟ Retro…˟ Simia…˟ Simia…˟ Influ…˟
           <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
 1             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 2             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 3             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 4             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 5             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 6             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 7             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 8             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 9             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
10             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
# … with 18 more variables: `Influenza (Liang) PCR` <int>, `Rhabdo (CII) PCR` <int>, `Enterovirus CII I PCR` <int>,
#   `Enterovirus CII-II PCR` <int>, `Alphav   (Sanchez-Seco) PCR` <int>, `Lyssavirus (Vasquez-Moron) PCR` <int>,
#   `Seadornavirus (CII) PCR` <int>, `Hantavirus (Raboni) PCR` <int>, `Hantavirus (Klempa) PCR` <int>,
#   `Nipah (Wacharapleusadee) PCR` <int>, `Henipa (Feldman) PCR` <int>, `Bunya S (Briese) PCR` <int>,
#   `Bunya L (Briese) PCR` <int>, `Phlebo (Sanchez-Seco) PCR` <int>, species <chr>, code <int>, forestsite <chr>,
#   new_column <chr>, and abbreviated variable names ¹`Filovirus (MOD) PCR`, ²`Filovirus (A) PCR`, ³`Filovirus (B) PCR`,
#   ⁴`Filo C PCR`, ⁵`Filovirus (D) PCR`, ⁶`Coronavirus   (Quan) PCR`, ⁷`Coronavirus (Watanabe) PCR`, …
# ℹ Use `colnames()` to see all variable names

First answer: The dplyr pendant would be: Data taken from @langtang(many thanks):

library(dplyr)

df %>%
  mutate(new_column = if_else(rowSums(select(., contains("PCR"))) > 0, "positive", "negative"))

   species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR
1       SM    1    UMNP-mangabey            negative                  negative
2       SM    1    UMNP-mangabey            negative                  positive
3       RC    9 UMNP-hondohondoc            negative                  negative
4      BWC    9 UMNP-hondohondod            negative                  negative
5      BWC    9 UMNP-hondohondod            negative                  negative
6      BWC    9 UMNP-hondohondod            negative                  negative
7      BWC    9 UMNP-hondohondod            negative                  negative
8      BWC    9 UMNP-hondohondod            negative                  negative
9      BWC    9 UMNP-hondohondod            negative                  negative
10     BWC    9 UMNP-hondohondod            negative                  negative

I've tried this as well, and I get the same error code when I use my dput data. I apologize for not having that in the original post! — Marnee Roundtree, Mar 04 '23 at 14:03
The update works, but then my new column just reads instead of my value. — Marnee Roundtree, Mar 04 '23 at 14:28
`df %>% type.convert(as.is=TRUE) %>% mutate(new_column = if_else(rowSums(select(., contains("PCR"))) > 0, "positive", "negative")) %>% select(new_column)` will give `new_column 1 negative 2 negative 3 negative 4 negative 5 negative 6 negative 7 negative 8 negative 9 negative 10 negative` — TarJae, Mar 04 '23 at 14:30
Yes, it does! So, now I need to use that column in a function for determining probability. I was going to do a chi-square test, but of course that's not working. Do I need to start a new thread if I need help there? — Marnee Roundtree, Mar 04 '23 at 14:48

score 1 · Answer 3 · answered Mar 04 '23 at 15:22

Another option is if_any

library(dplyr)
df1 %>%
 type.convert(as.is = TRUE) %>%
 mutate(new_column = c("negative", "positive")[if_any(contains("PCR")) + 1])

-output

  species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR new_column
1       SM    1    UMNP-mangabey                   0                         0   negative
2       SM    1    UMNP-mangabey                   0                         1   positive
3       RC    9 UMNP-hondohondoc                   0                         0   negative
4      BWC    9 UMNP-hondohondod                   0                         0   negative
5      BWC    9 UMNP-hondohondod                   0                         0   negative
6      BWC    9 UMNP-hondohondod                   0                         0   negative
7      BWC    9 UMNP-hondohondod                   0                         0   negative
8      BWC    9 UMNP-hondohondod                   0                         0   negative
9      BWC    9 UMNP-hondohondod                   0                         0   negative
10     BWC    9 UMNP-hondohondod                   0                         0   negative

R - adding a new column based on binary data across many columns

3 Answers3

Original answer (assuming numeric columns):