5

I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.

sample data

sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
  sample_df[i] <-  sample(1:4, 6, replace = TRUE)
  sample_df
}

desired output

library(tidyverse)    
sample_df %>% 
  gather(var, value, V3:V5) %>% 
  mutate(n_var = n_distinct(var)) %>% # get the number of columns
  group_by(id, group, value) %>% 
  mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var" 
  spread(var, value) %>%
  select(-n_var)

#> # A tibble: 6 x 6
#> # Groups:   id, group [6]
#>   id    group test     V3    V4    V5
#>   <chr> <chr> <lgl> <int> <int> <int>
#> 1 a     r     FALSE     3     3     1
#> 2 b     l     FALSE     1     4     4
#> 3 c     r     FALSE     2     4     2
#> 4 d     l     FALSE     2     1     2
#> 5 e     r     TRUE      4     4     4
#> 6 f     l     FALSE     2     2     3

Created on 2019-02-27 by the reprex package (v0.2.1)

Does not need to be dplyr. I just used it for showing what I want to achieve.

tjebo
  • 21,977
  • 7
  • 58
  • 94
  • 1
    `rowSums(sample_df[ , 3:5] == sample_df[ , 3]) == 3` – Henrik Feb 27 '19 at 14:24
  • @Henrik thanks. However, I need a solution which can be used for many many columns , and programmatically (therefore my step to "count the columns" first) – tjebo Feb 27 '19 at 14:25
  • 1
    `cols_to_test = 3:5; rowSums(sample_df[, cols_to_test]) == sample_df[, cols_to_test[1]) == length(cols_to_test)`. Set `cols_to_test` to be the indices or names of whatever columns you want to test. Perfectly generalizable. – Gregor Thomas Feb 27 '19 at 14:27
  • 1
    You can check row variance by using this function https://stackoverflow.com/a/25100036/1286528 `!as.logical(RowVar(sample_df[, 3:ncol(sample_df)]))` – pogibas Feb 27 '19 at 14:27
  • OK you're all right of course. Thanks to everyone. As I said, this was a bit of a brain freeze moment here. You're stars. Thanks – tjebo Feb 27 '19 at 14:28
  • You could also do `apply(sample_df[, cols_to_test], 1, function(x) length(unique(x)) == 1)`. – Gregor Thomas Feb 27 '19 at 14:29
  • @ Gregor or Henrik, I'd appreciate if you would kindly put your suggestion as an answer. I think it's worth it, even if it's only a one-liner. Would love to give you rep :) – tjebo Feb 27 '19 at 14:30
  • 1
    Related, dupe-oid [Count the number of rows where all columns have identical values](https://stackoverflow.com/questions/45948770/count-the-number-of-rows-where-all-columns-have-identical-values). Just provide a clear logic to identify relevant columns to check (name, position, class et c) - you hard-coded them in your own attempt (`V3:V5`), so I did as well. – Henrik Feb 27 '19 at 14:31
  • 1
    A point against the row variance approach is that it will only work for numeric data. `==` and `length(unique())` will work regardless of data type. – Gregor Thomas Feb 27 '19 at 14:33

1 Answers1

4

There are a bunch of ways to check for equality row-wise. Two good ways:

# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)

# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)

If you only want to test some columns, then use a subset of columns rather than the whole data frame:

cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)

# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)

Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294