1

I want to create a function that allows me to input a data frame with a varying number of columns, and to create two new columns:

  1. one based on a logical comparison of all others and
  2. one based on a logical comparison of all others and the first new column.

A minimal example would be a data set with two variables:

V1 <- c(1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0)
V2 <- c(0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0)
Data <- data.frame(V1, V2)

I want to create the two new columns with a function looking like this:

my.spec.df <- function(data, variables, new.var.name){
   new.df <- data

   # First new column
   new.df[[new.var.name]] <- 0
   new.df[[new.var.name]][new.df$V1 == Lag(new.df$V1, 1) & new.df$V2 == Lag(new.df$V2, 1)] <- 1 # I want my logical comparison to be applicable to all variables listed in [[variables]], not just V1 and V2 used here as minimal example

   # Second new column
   new.df$Conj.Var.[[new.var.name]] <- 0 # I want this second new column to take the name "Conj.Var."+the name of the first new variable, which I tried to achieve with the [[]] but it did not work (same in the next row)
   new.df$Conj.Var.[[new.var.name]][new.df$V1 == 1 & new.df$V2 == 1 & new.df[[new.var.name]] == 1] <- 1 # Again, I want the logical comparison to be applicable to all variables listed [[variables]] and the first newly created column

   return(new.df)
}

spec.df <- my.spec.df(Data,
                      variables=c("V1", "V2"),
                      new.var.name="NV1")

The new data frame should look like:

print(spec.df)
   V1 V2 NV1 Conj.Var.NV1
1   1  0   0            0
2   0  1   0            0
3   1  1   0            0
4   1  1   1            1
5   0  0   0            0
6   0  1   0            0
7   1  0   0            0
8   1  0   1            0
9   0  0   0            0
10  0  1   0            0
11  0  1   1            0
12  1  1   0            0
13  1  0   0            0
14  0  1   0            0
15  0  0   0            0

As commented in the code, I struggle with three things:

  1. apply the logical comparisons for the first new column to all variables listed (not just the two as in my minimal example) because the number could go from one variable listed to multiple ones,
  2. format the name of the second new column based on the name introduced for the first and
  3. apply the logical comparison for the second new column also to all variables listed.

Anyone that could help? Many thanks in advance!

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
CNiessen
  • 89
  • 6

2 Answers2

1

Here is a solution.
It uses an auxiliary function all_one_by_row to do the main work. And a temporary logical matrix to store the values equal to the lagged values in variables columns.

all_one_by_row <- function(data, cols) {
  if(missing(cols))
    as.integer(rowSums(data) == ncol(data))
  else
    as.integer(rowSums(data[cols]) == ncol(data[cols]))
}

my.spec.df <- function(data, variables, new.var.name){
  new.df <- data

  # First new column
  tmp <- sapply(new.df[variables], \(x) x == Lag(x, 1))
  tmp[is.na(tmp)] <- FALSE
  new.df[[new.var.name]] <- all_one_by_row(tmp)
  
  # Second new column
  New.Col <- paste0("Conj.Var.", new.var.name)
  Cols <- c(variables, new.var.name)
  new.df[[New.Col]] <- all_one_by_row(new.df, Cols)

  new.df
}

spec.df <- my.spec.df(Data,
                      variables=c("V1", "V2"),
                      new.var.name="NV1")

spec.df
#   V1 V2 NV1 Conj.Var.NV1
#1   1  0   0            0
#2   0  1   0            0
#3   1  1   0            0
#4   1  1   1            1
#5   0  0   0            0
#6   0  1   0            0
#7   1  0   0            0
#8   1  0   1            0
#9   0  0   0            0
#10  0  1   0            0
#11  0  1   1            0
#12  1  1   0            0
#13  1  0   0            0
#14  0  1   0            0
#15  0  0   0            0
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Note that the sample data is not enough to test for cases of more than 2 variables. note: I have inserted a 1 at position 13 in the V2 variable.

V1 <- c(1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0)
V2 <- c(0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0)
V3 <- c(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0)
Data <- data.frame(V1, V2, V3)
library(tidyverse)
library(rlang)
my.spec.df <- function(data, variables, new.var.name){
  x <- sym(variables[1])
  y <- syms(variables[-1]) # list of variable names
  ind <- ncol(data) + (2+length(y))
  lgl <- parse_expr(paste0(lapply(y, \(x){
    paste(x, "== lag(", x, ")")
  }), collapse = " & "))
  lgl2 <- parse_expr(paste0(lapply(y, paste, "== 1"), collapse = " & "))
  comps <- expr(!!x == x_lag & !!lgl)
  comps2 <- expr(!!x == 1 & !!lgl2 & .[[ind]] == 1)
  data %>%
    mutate(x_lag = lag(!!x, 1, default = 0)) %>%
    mutate_at(vars(!!!y), funs(lag = lag(., default = 0))) %>%
    mutate("{new.var.name}" := ifelse(!!comps, 1, 0)) %>%
    mutate("Conj.var.{new.var.name}" := ifelse(!!comps2, 1, 0)) %>%
  select(-ends_with("lag"))
}

For versions of dplyr 1.0 and greater, we can use syntax from the glue package to name new variables through := See this post for other methods. Because we don't know the number of variables, we to refer to the new column dynamically. This Stack overflow post lists various methods to do that.

When tested on the sample data, my.spec.df(Data, variables = c("V1", "V2"), new.var.name = "NV1") returns

   V1 V2 V3 NV1 Conj.var.NV1
1   1  0  0   0            0
2   0  1  0   0            0
3   1  1  0   0            0
4   1  1  1   1            1
5   0  0  1   0            0
6   0  1  1   0            0
7   1  0  0   0            0
8   1  0  0   1            0
9   0  0  0   0            0
10  0  1  0   0            0
11  0  1  1   1            0
12  1  1  1   0            0
13  1  1  1   1            1
14  0  1  0   0            0
15  0  0  0   0            0

and my.spec.df(Data, variables = c("V1", "V2", "V3"), new.var.name = "NV1") returns

   V1 V2 V3 NV1 Conj.var.NV1
1   1  0  0   0            0
2   0  1  0   0            0
3   1  1  0   0            0
4   1  1  1   0            0
5   0  0  1   0            0
6   0  1  1   0            0
7   1  0  0   0            0
8   1  0  0   1            0
9   0  0  0   0            0
10  0  1  0   0            0
11  0  1  1   0            0
12  1  1  1   0            0
13  1  1  1   1            1
14  0  1  0   0            0
15  0  0  0   0            0
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • Many thanks again. When I run the code, exactly as posted by you, I get `Error in !y : invalid argument type`. Any clue why? – CNiessen Oct 25 '21 at 17:50
  • Did you run the function line by line or called it via `my.spec.df(Data, variables = c("V1", "V2"), new.var.name = "NV1")`? That error is when the `!!` or `!!!` is taken out of its intended context (which is within `tidyverse` functions). See https://stackoverflow.com/questions/53093630/r-bare-to-quosure-in-function-invalid-argument-type. Several other possibilities: Check when you use `library(tidyverse, quietly = FALSE)` in a fresh session that `x dplyr::lag() masks stats::lag()` is under conflicts – Donald Seinen Oct 25 '21 at 18:02
  • I called it via `my.spec.df(Data, variables = c("V1", "V2"), new.var.name = "NV1")`. What do you mean by "Check when (...) under conflicts"? – CNiessen Oct 25 '21 at 18:36
  • Let's try and run `version[c("major", "minor")]`. Check if it is at least R 3.5 or higher, earlier versions have some issues with `!!`, which is a special operator defined by the `rlang` package. If that is okay, check `packageVersion("rlang")`. Mine is 0.4.11. I was unable to reproduce your error using this version. If all that checks out, try the previous again: when `library(tidyverse)` is run for the first time when you open RStudio it will print a lot of information in the console. One line states a warning -- `dplyr::lag overwrites stats::lag`. – Donald Seinen Oct 25 '21 at 18:54
  • I updated R and it is now 4.1.1, with the error persisting. My `rlang` version is 0.4.12. When calling `library(tidyverse)`, the warnings I get are `dplyr::filter() masks stats::filter()` and `dplyr::lag() masks stats::lag()`. – CNiessen Oct 26 '21 at 07:52
  • could you run `rlang::last_trace()` and screenshot it? The only way i was able to reproduce the error was by adding a 4th `!` to the `mutate_at` step. Make sure there are 3 `!` there, i.e `mutate_at(vars(!!!y), ... )` – Donald Seinen Oct 26 '21 at 08:26
  • Or rather, `traceback()` – Donald Seinen Oct 26 '21 at 08:34
  • I copied the code exactly as you put it above. I ran the two tests you requested. (1) `rlang::last_trace()` returns: `Can't show last error because no error was recorded yet -- Backtrace: x -- 1. \-rlang::last_trace() -- 2. \-rlang::last_error()`. [I first performed the operation, got the same error message and it still tells me there was no error recorded] (2) `traceback()` returns 13 lines that are too long to be posted here. But I can send them to you by e-mail if you want. – CNiessen Oct 27 '21 at 09:14
  • That being said, I do not want to steal your time with fixing the bug. I report it here so you see what happened on my end. But I totally understand if you do not have the time to look into this further (the other solution helped me out already). I highly appreciate already all the generous help you offered. – CNiessen Oct 27 '21 at 09:14
  • @CNiessen Don't worry about wasting peoples time. If code is faulty, it should be pointed out and adjusted by the answerer. I'm curious too as to what causes the result to be different on our machines, but I'm stumped. I have not managed to reproduce the error without changing the input, having tested it on 3 more machines. I'll leave the answer here for now in case it helps some searcher in the future. As a side note, consider accepting (clicking the checkmark) Rui Barradas' answer. https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – Donald Seinen Oct 28 '21 at 08:46