0

I have a dataframe like below. I would like to add 2 columns:

ContainsANZ: Indicates if any of the columns from F0 to F3 contain 'Australia' or 'New Zealand' ignoring NA values

AllANZ: Indicates if all non NA columns contain 'Australia' or 'New Zealand'

Starting dataframe would be:

dfContainsANZ
  Col.A Col.B Col.C            F0            F1            F2        F3
1  data     0   xxx     Australia     Singapore          <NA>      <NA>
2  data     1   yyy United States United States United States      <NA>
3  data     0   zzz     Australia     Australia     Australia Australia
4  data     0   ooo     Hong Kong        London     Australia      <NA>
5  data     1   xxx   New Zealand          <NA>          <NA>      <NA>

The end result should look like this:

df
  Col.A Col.B Col.C            F0            F1            F2        F3 ContainsANZ      AllANZ
1  data     0   xxx     Australia     Singapore          <NA>      <NA>   Australia   undefined
2  data     1   yyy United States United States United States      <NA>   undefined   undefined
3  data     0   zzz     Australia     Australia     Australia Australia   Australia   Australia
4  data     0   ooo     Hong Kong        London     Australia      <NA>   Australia   undefined
5  data     1   xxx   New Zealand          <NA>          <NA>      <NA> New Zealand New Zealand

I'm using dplyr (preferred solution) and have come up with a code like this which doesn't work and is very repetitive. Is there a better way to write this so that I am not having to copy F0|F1|F2... rules over again? My real data set has more. Is NAs interfering with the code?

df <- df %>%
mutate(ANZFlag = 
    ifelse(
    F0 == 'Australia' | 
    F1 == 'Australia' |
    F2 == 'Australia' | 
    F3 == 'Australia',
    'Australia', 
        ifelse(
        F0 == 'New Zealand' | 
        F1 == 'New Zealand' |
        F2 == 'New Zealand' | 
        F3 == 'New Zealand',
        'New Zealand', 'undefined'
        )
    )
)
LoLo
  • 3
  • 1
  • You can do things that are quite competitive speed-wise like `rowSums(df[paste0("F",0:3)] == "Australia", na.rm=TRUE) > 0` to check if certain values are present. – thelatemail Apr 19 '16 at 02:03
  • Thanks, thats really handy. Will give it a go with hrbrmstr's solution. – LoLo Apr 20 '16 at 22:30

1 Answers1

1

Still some typing, but I think this gets at the essence you're looking for:

library(dplyr)

df <- read.table(text='Col.A,Col.B,Col.C,F0,F1,F2,F3
data,0,xxx,Australia,Singapore,NA,NA
data,1,yyy,"United States","United States","United States",NA
data,0,zzz,Australia,Australia,Australia,Australia
data,0,ooo,"Hong Kong",London,Australia,NA
data,1,xxx,"New Zealand",NA,NA,NA', header=TRUE, sep=",", stringsAsFactors=FALSE)

down_under <- function(x) {
  mtch <- c("Australia", "New Zealand")
  cols <- unlist(x)[c("F0", "F1", "F2", "F3")]
  bind_cols(x, data_frame(ContainsANZ=any(mtch %in% cols, na.rm=TRUE),
                          AllANZ=all(as.vector(na.omit(cols)) %in% cols)))
}

rowwise(df) %>% do(down_under(.))

## Source: local data frame [5 x 9]
## Groups: <by row>
## 
##   Col.A Col.B Col.C            F0            F1            F2        F3 ContainsANZ AllANZ
##   (chr) (int) (chr)         (chr)         (chr)         (chr)     (chr)       (lgl)  (lgl)
## 1  data     0   xxx     Australia     Singapore            NA        NA        TRUE   TRUE
## 2  data     1   yyy United States United States United States        NA       FALSE   TRUE
## 3  data     0   zzz     Australia     Australia     Australia Australia        TRUE   TRUE
## 4  data     0   ooo     Hong Kong        London     Australia        NA        TRUE   TRUE
## 5  data     1   xxx   New Zealand            NA            NA        NA        TRUE   TRUE
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205