4

I'd like to know if there is any efficient way of checking if a given pair (or tuple of more than two) of columns is in a data frame.

For example, suppose I had the following data frame:

df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1))
names(df)=c('col1','col2','col3')

  col1 col2 col3
1    a    e    1
2    b    f    0
3    c    g    0
4    d    h    1

and I want to check if this table contains a list of pairs of columns, say: (a,b), (a,c), (a,e), (c,a), (c,g), (a,f)

to which it should output:

FALSE FALSE TRUE FALSE TRUE FALSE

Edit: added a new pair (a,f) to avoid confusion

I thought of doing this by concatenating the columns into strings and then comparing with %in%, but this is quite inefficient. I also thought of doing a loop with dplyr's filter, but it also takes quite a time when the table is huge and requires converting formats (i.e. writing several lines).

Is there any efficient way of accomplishing this in R?

anymous_asker
  • 113
  • 1
  • 11

2 Answers2

2

This seems like a case for one of the apply or lapply family of functions. If you define pairs.list as a list, you can use lapply:

df = data.frame(c("a","b","c","d"), c("e","f","g","h"), c(1,0,0,1))
names(df) = c('col1','col2','col3')
pairs.list = list(c("a", "b"), c("a", "c"), c("a", "e"), c("c", "a"), c("c", "g"))
lapply(pairs.list, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] TRUE

[[4]]
[1] FALSE

[[5]]
[1] TRUE

new.pairs = list(c("a", "b"), c("a", "c"), c("e", "a"), c("c", "a"), c("c", "g"))

lapply(new.pairs, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] FALSE

[[4]]
[1] FALSE

[[5]]
[1] TRUE

With this method, if you want to know the row of the df that matches, you can get rid of the any() call and receive a list of a vector of booleans, where each vector is the same length as df.

I think this should be relatively efficient because it is all boolean logic rather than string manipulation, but I'm not an expert at benchmarking performance in R so I don't know for sure.

Curt F.
  • 4,690
  • 2
  • 22
  • 39
1

If you only need to check that the column combinations are in the table or not, you could use unique to cut down the number of comparisons:

df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1), stringsAsFactors=FALSE)
names(df)=c('col1','col2','col3')

df$to_check = paste(df$col1, df$col2, sep=',')
cols <- c("a,b", "a,c", "a,e", "c,a", "c,g")

cols %in% unique(df$to_check)
ajerneck
  • 751
  • 1
  • 7
  • 19