Counting unique values across variables (columns) in R

Question

I have a large dataset with repeated measures over 5 time periods.

   2012  2009  2006  2003  2000
    3     1     4     4     1
    5     3     2     2     3
    6     7     3     5     6

I want to add a new column, which is the number of unique values among years 2000 to 2012. e.g.,

   2012  2009  2006  2003  2000  nunique
    3     1     4     4     1      3
    5     3     2     2     3      3
    6     7     3     5     6      4

I am working in R and, if it helps, there are only 14 possible different values of the measured value at each time period.

I found this page: Count occurrences of value in a set of variables in R (per row) and tried the various solutions offered on it. What it gives me however is a count of each value, not the number of unique values. Other similar questions on here seem to ask about counting number of unique values within a variable /column, rather than across each row. Any suggestions would be appreciated.

score 2 · Answer 1 · answered Sep 24 '14 at 20:58

2

Here's one alternative

> df$nunique <- apply(df, 1, function(x) length(unique(x)))
> df
  2012 2009 2006 2003 2000 nunique
1    3    1    4    4    1       3
2    5    3    2    2    3       3
3    6    7    3    5    6       4

answered Sep 24 '14 at 20:58

Jilber Urbina

58,147
10
114
138

1

Please note: if your dataframe has NAs in it, this will count those as unique values. Ammend with: df$nunique <- apply(df, 1, function(x) length(unique(na.omit(x)))) - I've added an 'na.omit' applied to 'x'. – Jordan Collins Mar 17 '16 at 15:41

Michael Lawrence · Answer 2 · 2014-09-24T23:17:24.767

1

If you have a large dataset, you may want to avoid looping over the rows, but use a faster framework, like S4Vectors:

df <- data.frame('2012'=c(3,5,6),
             '2009'=c(1,3,7),
             '2006'=c(4,2,3),
             '2003'=c(4,2,5),
             '2000'=c(1,3,6))

dup <- S4Vectors:::duplicatedIntegerPairs(as.integer(as.matrix(df)), row(df))
dim(dup) <- dim(df)
rowSums(!dup)

Or, the matrixStats package:

m <- as.matrix(df)
mode(m) <- "integer"
rowSums(matrixStats::rowTabulates(m) > 0)

edited Sep 24 '14 at 23:17

answered Sep 24 '14 at 23:09

Michael Lawrence

1,031
5
6

Tried that with the matrixStats package - it is indeed much faster - thanks! – user3251223 Sep 26 '14 at 17:02
S4Vectors is about 4 times faster than it. – Michael Lawrence Sep 26 '14 at 18:50

score 0 · Accepted Answer · answered Sep 24 '14 at 21:00

The trick is to use 'apply' and assign each row to a variable (e.g. x). You can then write a custom function, in this case one that uses 'unique' and 'length' to get the answer that you want.

df <- data.frame('2012'=c(3,5,6), '2009'=c(1,3,7), '2006'=c(4,2,3), '2003'=c(4,2,5), '2000'=c(1,3,6))

df$nunique = apply(df, 1, function(x) {length(unique(x))})

score 0 · Answer 4 · answered May 15 '18 at 04:04

0

try this one out:

sapply(data, function(x) length(unique(x)))

answered May 15 '18 at 04:04

Mudit Gupta

401
4
8

Counting unique values across variables (columns) in R

4 Answers4