2

I have a dataset where a bunch of character columns only have one value, the name of the column itself. Each row is an observation, and I want to count how many such columns exist for each row.

For example:

id multi_value_col single_value_col_1 single_value_col_2
1  A               single_value_col_1 
2  D2              single_value_col_1 single_value_col_2
3  Z6                                 single_value_col_2

What I'd like is add a column that counts how many of those single value columns there are per row. Like so:

id multi_value_col single_value_col_1 single_value_col_2  count
1  A               single_value_col_1                     1 
2  D2              single_value_col_1 single_value_col_2  2
3  Z6                                 single_value_col_2  1

My initial idea was to use mutate_if and n_distinct, replacing the string by TRUE, which could then be used in a mutate with rowSums:

data %>% 
 mutate_if(~n_distinct(.) == 1, TRUE, .) %>%
 mutate(count = rowSums???)

However, I can't get the mutate_if working, and I'm not sure about the rowSums command either—is there a sum_if TRUE operating available across rows?

Khashir
  • 341
  • 3
  • 20
  • Can you post a sample of your data in code? e.g. `dput()`? Maybe it is something like this too? https://stackoverflow.com/questions/18862114/count-number-of-columns-by-a-condition-for-each-row – william3031 Jun 26 '19 at 00:14

2 Answers2

3

If we are checking with the column names, then

library(tidyverse)
data %>%
    mutate(count = pmap_int(.[-1], ~ {x1 <- c(...)
                 sum(x1 == names(x1))} ))
#  id multi_value_col single_value_col_1 single_value_col_2 count
#1  1               A single_value_col_1                        1
#2  2              D2 single_value_col_1 single_value_col_2     2
#3  3              Z6                    single_value_col_2     1

Or in base R

rowSums(data[-1] == names(data)[-1][col(data[-1])])

If the dataset have NA's, just change it to

rowSums(data[-1] == names(data)[-1][col(data[-1])], na.rm = TRUE)

Other options suggested by @thelatemail include to transpose the selected columns, and do a colSums on the logical matrix

nms <- names(data)[nm1]
colSums(t(data[nms]) == nms)

Or with Reduce

Reduce(`+`, Map(`==`, data[nms], nms))

If the intention is to count based on the non blanks, in base R, we create a logical matrix with the columns of interest and do the rowSums on it

nm1 <- grep("single_value", names(data))
data$count <-  rowSums(data[nm1] != "")

With dplyr

library(dplyr)
data %>% 
    mutate(count = rowSums(.[nm1] != ""))
#  id multi_value_col single_value_col_1 single_value_col_2 count
#1  1               A single_value_col_1                        1
#2  2              D2 single_value_col_1 single_value_col_2     2
#3  3              Z6                    single_value_col_2     1

data

data <- structure(list(id = 1:3, multi_value_col = c("A", "D2", "Z6"), 
    single_value_col_1 = c("single_value_col_1", "single_value_col_1", 
    ""), single_value_col_2 = c("", "single_value_col_2", "single_value_col_2"
    )), row.names = c(NA, -3L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 3
    You could also spin the dataset with `t` and compare - `colSums(t(data[nms]) == nms)` - where `nms` is the names of the columns. Or functional approach like the `pmap` in base R - `Reduce(\`+\`, Map(\`==\`, data[nms], nms))` – thelatemail Jun 26 '19 at 00:40
1

Depending on if you have NA in your data or blank cells you could use one of the following base R approach where we first find out columns with only one unique value and then count the non-NA or non-blank cells per row in those respective columns.

If you have NA's

cols <- which(sapply(df, function(x) length(unique(na.omit(x)))) == 1)
df$count <- rowSums(!is.na(df[cols]))

df
#  id multi_value_col single_value_col_1 single_value_col_2 count
#1  1               A single_value_col_1               <NA>     1
#2  2              D2 single_value_col_1 single_value_col_2     2
#3  3              Z6               <NA> single_value_col_2     1

If you have empty cells

cols <- which(sapply(df, function(x) length(unique(x[x!=""]))) == 1)
df$count <- rowSums(df[cols] != "")
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I think the empty-cell approach will work, but I get an error: `Error in x[x != ""] : object of type 'symbol' is not subsettable.` – Khashir Jun 26 '19 at 19:44
  • @Khashir do you have some `T`/`F` or `TRUE/FALSE` values in the data? Can you convert all the columns to characters before applying this? `df[] <- lapply(df, as.character)` – Ronak Shah Jun 27 '19 at 01:33