2

I have multiple response data that has been split into separate columns with cSplit_e into a format like this...

  ID Response IM2   IM4 ...   IM10  IM16
1  1   4,7,10  NA     1          1    NA
2  2 7,5,16,8  NA    NA         NA     1
3  3     2,10   1    NA          1    NA

I'm trying to set up a function will check each row to see if a subset of columns contains at least one "1". It will then create a new column, setting it to "1" if a row had at least one "1" in the specified columns.

Previously I have done this by writing out a for loop for each column I want to create, like so...

parade$q9PaperAggregate <- NA
parade$q9MagazineAggregate <- NA

#Newspaper Aggregate Loop
for (i in 1:nrow(parade)) { #Starts loop setting i to each row number
    if (is.na(parade$q9PaperAds[i]) == FALSE | ##These three lines check each row is not all NA
        is.na(parade$q9PaperCircs[i]) == FALSE |
        is.na(parade$q9PaperWebAds[i]) == FALSE) {
            parade$q9PaperAggregate[i] <- 1 #Sets agg cell value to 1 if not all NA for each i
    }
}
#Magazine Aggregate Loop
for (i in 1:nrow(parade)) {
    if (is.na(parade$q9MagazineAds[i]) == FALSE |
        is.na(parade$q9MagazineWebAds[i]) == FALSE) {
            parade$q9MagazineAggregate[i] <- 1
    }
}

This works, but is clearly inefficient. I want to create a general function that does this for inputs. Here is what I have so far:

#df = object; n = new column name; col = vector of columns I want to check
atleastone <- function(df, n, col) {
#n = new column name (will run over list of vector - new col names with the old columns you want to agg)
    df[n] <- NA
    for (i in 1:nrow(df)) { #Starts loop setting i to each row number
    if (df[i, col] == 1) {
            (df[n])[i] <- 1 #Sets new column cell value to 1 if not all NA for each i
            }
    }
}

My main two issues are 1) how to run the for loop to check multiple columns for the value if the number of columns to be checked can vary and 2) how to pass the row and column to subset. Currently "col" uses the actual name of the column while "i" just takes the numerical row value. This was fine in the format I used before of...

df$column[i]

...but the $ operator doesn't seem to work with values being passed to it from a function.

Any idea what I'm doing wrong here? Is there a better way to do this?

Thank you for your time.

EDIT:

I turned @SymbolixAU's response into a function, like so:

#Aggregate Function
#takes input df = object; n = name of new column in double quotes; l = columns you want to agg
agger <- function(df, l, n) {
    #checks if the sum of the rows in the specified columns is greater than 1
    #this produces a logical value which is multiplied by 1 to change it to numeric
    df[n] <- ((rowSums(df[, l] == 1) > 0) * 1)
}

Follow-up question - I am trying to use mapply to pass a list "x" of two different vectors of columns to the argument "l" and a vector "y" of two names for the two new columns that will be created and the target object df = BR. The command looks like this:

mapply(agger, l = x, n = y, MoreArgs = list(BR))

This is sending me to the debug window with no messages or info on what is going wrong. Is my mapply set up incorrectly and/or is there a better way to run this function on multiple groups of columns in the same dataframe?

Thank you.

Donovan192
  • 658
  • 1
  • 8
  • 13
  • 1
    Regarding why `df$column[i]` doesn't work, see [this R-FAQ](http://stackoverflow.com/q/18222286/903061). tl;dr use `df[column[i]]` or `df[[column[i]]]` when using column names stored as strings, `$` won't work with variable column names. – Gregor Thomas Nov 04 '16 at 02:32
  • 1
    If you've already accepted an answer for your original question, you're unlikely to get a response if you then update your question. You're better off asking a separate question. – SymbolixAU Nov 04 '16 at 23:38

3 Answers3

6

This can be done in one line without any loops or *applys:

df$new_col <- ((rowSums(df[, col] == 1, na.rm=T) > 0) * 1)

Explanation

  1. You can check if a value in a subset of columns is equal to 1 using df[, col] == 1

  2. Then you can check how many of those values there are in each row using rowSums()

  3. Then if there are any rowSums results with a value greater than 0, you know there was at least one 1 in one of the columns in that row

  4. the > 0 check returns a logical, so multiply it by 1 to turn it into numeric


Example

## taking the data you've provide
df <- read.table(text = "ID Response IM2   IM4  IM10  IM16
1   4,7,10  NA     1          1    NA
2 7,5,16,8  NA    NA         NA     1
3     2,10   1    NA          1    NA", header = T)


df
#   ID Response IM2 IM4 IM10 IM16
# 1  1   4,7,10  NA   1    1   NA
# 2  2 7,5,16,8  NA  NA   NA    1
# 3  3     2,10   1  NA    1   NA

## specify the columns of interest
col <- c("IM4", "IM10")

## assign the new column
df$new_col <- ((rowSums(df[, col] == 1, na.rm=T) > 0) * 1)
df
#    ID Response IM2 IM4 IM10 IM16 new_col
# 1  1   4,7,10  NA   1    1   NA       1
# 2  2 7,5,16,8  NA  NA   NA    1       0
# 3  3     2,10   1  NA    1   NA       1

And this also works on the data provided by @Barker

set.seed(100)
df <- data.frame(ID = 1:20)
df[paste0("IM", 1:7)] <- replicate(7,sample(c(rep(NA,5),1,1),20, replace = TRUE))

col <- paste0("IM", 1:7)
df$new_col <- ((rowSums(df[, col] == 1, na.rm=T) > 0) * 1)

Update

I'm not entirely sure what you're asking in your comment, but you can do something like

## using @Barker 's data gain, create a list of groups of columns
col_groups <- list(grp1 = c("IM1","IM2"),
                   grp2 = c("IM2","IM7"),
                   grp3 = c("IM5","IM7"))

## use lapply to do the calculation for each group of columns
df2 <- lapply(col_groups, function(x){
  df['new_col'] <- ((rowSums(df[, x] == 1, na.rm=T) > 0) * 1)
  return(df)
})

This will obviously return a list of data.frames, each one named according to the grp value, and each having a different new_col appended.

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • Excellent, that is exactly what I was looking for! One small note for anyone else looking at this (and related to part of my original question) - to use this in a function you need to use `df[new_col] <- ` instead of `df$new_col <- ` because you can't pass strings to the `$` operator. – Donovan192 Nov 04 '16 at 15:28
  • Do you have any input on how I can run this (as turned into a function) on multiple groups of columns? Thanks – Donovan192 Nov 04 '16 at 17:45
  • @MaxHeadroom - I've updated my answer. I'm not entirely sure what you're asking for with respect to 'multiple groups of columns', or what you expect the output to be, but it should give you an idea. – SymbolixAU Nov 04 '16 at 23:37
1

This can actually be done pretty simply with an apply function.

Lets start by making an example data set to work with:

set.seed(100)
df <- data.frame(ID = 1:20)
df[paste0("IM", 1:7)] <- replicate(7,sample(c(rep(NA,5),1,1),20, replace = TRUE))

Now that we have the data, we can actually do the task in a single line:

df[["newName"]] <- apply(df[grep("^IM", names(df))] == 1, 1, any, na.rm = TRUE)

Let's go over what is happening here. First, we are using df[grep("^IM", names(df))] to subset the columns in df that start with IM. Next, we use == to determine which values equal 1. The apply function takes that new logical data frame and for each row (since the second parameter equals 1, columnes would be 2, each element would be c(1, 2)) it uses the any function. Any returns TRUE if at least one of the inputs are TRUE, however, if none are TRUE and one of the values is NA it will return NA since it doesn't know if NA counts as TRUE or FALSE. By adding na.rm = TRUE at the end, we tell the any function to ignore NA values. If you would rather have NA instead of FALSE for these values, simply remove the na.rm = TRUE statement.

Barker
  • 2,074
  • 2
  • 17
  • 31
1

This code

parade$q9PaperAggregate <- NA

#Newspaper Aggregate Loop
for (i in 1:nrow(parade)) { #Starts loop setting i to each row number
    if (is.na(parade$q9PaperAds[i]) == FALSE | ##These three lines check each row is not all NA
        is.na(parade$q9PaperCircs[i]) == FALSE |
        is.na(parade$q9PaperWebAds[i]) == FALSE) {
            parade$q9PaperAggregate[i] <- 1 #Sets agg cell value to 1 if not all NA for each i
    }
}

can be vectorized (to run faster) as :

parade$q9PaperAggregate <- ifelse(is.na(parade$q9PaperAds) &
                                  is.na(parade$q9PaperCircs) &
                                  is.na(parade$q9PaperWebAds), NA, 1)
HubertL
  • 19,246
  • 3
  • 32
  • 51