5

I have a data frame like below:

Group1  Group2  Group3  Group4
A       B       A       B   
A       C       B       A   
B       B       B       B   
A       C       B       D   
A       D       C       A   

I want to add a new column to the data frame which will have the count of unique elements in each row. Desired output:

Group1  Group2  Group3  Group4  Count
A       B       A       B       2
A       C       B       A       3
B       B       B       B       1
A       C       B       D       4
A       D       C       A       3

I am able to find such a count for each row using

length(unique(c(df[,c(1,2,3,4)][1,])))

I want to do the same thing for all rows in the data frame. I tried apply() with var=1 but without success. Also, it would be great if you could provide a more elegant solution to this.

smaug
  • 846
  • 10
  • 26
  • How many unique values does your "data.frame" have? How many rows? You could convert your dataset in a `table(row(df), as.matrix(df))` format that could be more convenient to operate on for such tasks. Also, probably, consider a sparse alternative of it. – alexis_laz Apr 24 '17 at 08:40

3 Answers3

9

We can use apply with MARGIN =1 to loop over the rows

df1$Count <- apply(df1, 1, function(x) length(unique(x)))
df1$Count
#[1] 2 3 1 4 3

Or using tidyverse

library(dplyr)
df1 %>%
    rowwise() %>%
    do(data.frame(., Count = n_distinct(unlist(.))))
# A tibble: 5 × 5
#   Group1 Group2 Group3 Group4 Count
#*  <chr>  <chr>  <chr>  <chr> <int>
#1      A      B      A      B     2
#2      A      C      B      A     3
#3      B      B      B      B     1
#4      A      C      B      D     4
#5      A      D      C      A     3

We can also use regex to do this in a faster way. It is based on the assumption that there is only a single character per each cell

nchar(gsub("(.)(?=.*?\\1)", "", do.call(paste0, df1), perl = TRUE))
#[1] 2 3 1 4 3

More detailed explanation is given here

Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @akrun, your answer shows how to correctly use apply() to solve the problem. However, can you suggest a more elegant method to do the same rather than finding the count of unique elements for each row, if at all an alternative exists? – smaug Apr 24 '17 at 06:27
  • @satnam I updated with a tidyverse approach which would be more elegant – akrun Apr 24 '17 at 06:29
  • @satnam Added an efficient approach using regex. Perhaps it is more elegant. – akrun Apr 24 '17 at 07:01
  • 3
    The regex one is very clever, though it won't work if the strings will be longer than one letter, no? – David Arenburg Apr 24 '17 at 07:22
  • @DavidArenburg THat is true, but here I assumed it is a single letter based on the example – akrun Apr 24 '17 at 07:23
  • 2
    @akrun, from my point of view that is misleading. AT LEAST you should clearly state this in the answer but you tend not to do that and answer based on implicit assumptions about real data sets. Plus, the standard apply answer is present in the question you linked to yourself. – talat Apr 24 '17 at 07:34
3

duplicated in base R:

df$Count <- apply(df,1,function(x) sum(!duplicated(x)))

#  Group1 Group2 Group3 Group4 Count
#1      A      B      A      B     2
#2      A      C      B      A     3
#3      B      B      B      B     1
#4      A      C      B      D     4
#5      A      D      C      A     3
989
  • 12,579
  • 5
  • 31
  • 53
2

Athough there are some pretty great solutions mentioned over here, You can also use, data.table :

DATA:

df <- data.frame(g1 = c("A","A","B","A","A"),g2 = c("B", "C", "B","C","D"),g3 = c("A","B","B","B","C"),g4 = c("B","A","B","D","A"),stringsAsFactors = F)

Code:

EDIT: After the David Arenberg's comment,added (.I) instead of 1:nrow(df). Thanks for valuable comments

library(data.table)
setDT(df)[, id := .I ]
df[, count := uniqueN(c(g1, g2, g3, g4)), by=id ]
df

Output:

> df
   g1 g2 g3 g4 id count
1:  A  B  A  B  1     2
2:  A  C  B  A  2     3
3:  B  B  B  B  3     1
4:  A  C  B  D  4     4
5:  A  D  C  A  5     3
PKumar
  • 10,971
  • 6
  • 37
  • 52
  • I am not very conversant with data tables in R, will comment with constructive feedback once I try this out in details, thanks! – smaug Apr 24 '17 at 06:44
  • No worries, read it. http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly – PKumar Apr 24 '17 at 06:52
  • 2
    This is basically the same as doing a for loop as you are running `1:nrow(df)` (btw, data.table has an `.I` operator instead) so this solution doesn't utilize data.tables advantages. – David Arenburg Apr 24 '17 at 07:12
  • Would it be possible to replace the `c(g1, g2, g3,g4)` with a character string or NSE? I tried `eval(names(df))` to no avail. – A Duv Jul 30 '18 at 23:15