Cumulative count of unique values in R

Question

A simplified version of my data set would look like:

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

I would like to make a new data set where, for each value of "depth", I would have the cumulative number of unique values, starting from the top. e.g.

depth cumsum
 1      2
 2      2
 3      3

Any ideas as to how to do this? I am relatively new to R.

Arun · Answer 1 · 2013-03-29T09:58:29.687

I find this a perfect case of using factor and setting levels carefully. I'll use data.table here with this idea. Make sure your value column is character (not an absolute requirement).

step 1: Get your data.frame converted to data.table by taking just unique rows.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth") # just to be sure before factoring "value"

step 2: Convert value to a factor and coerce to numeric. Make sure to set the levels yourself (it is important).
```
dt[, id := as.numeric(factor(value, levels = unique(value)))]
```

step 3: Set key column to depth for subsetting and just pick the last value

 setkey(dt, "depth", "id")
 dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]

#    depth id
# 1:     1  2
# 2:     2  2
# 3:     3  3

step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax to get the final output.
```
dt.out[, id := cummax(id)]
```

Edit: The above code was for illustrative purposes. In reality you don't need a 3rd column at all. This is how I'd write the final code.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

Here's a more tricky example and the output from the code:

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6

Here's a `dplyr` version: `df %>% arrange(depth) %>% mutate(value = cummax(as.numeric(factor(value, levels = unique(value))))) %>% arrange(depth, desc(value)) %>% distinct(depth)`. — Jake Fisher, Mar 24 '15 at 01:46
This method can be generally applied for when both `depth` and `value` are string values. Thanks! — ecoe, Dec 15 '15 at 12:47
What if we have a third variable called group and we want to achieve the same results for each group? — MLE, Sep 01 '18 at 17:53

MLE · Answer 2 · 2022-01-24T10:05:21.040

11

A dplyr attempt.

df %>%
  #group_by(group)%>% # if you have a third variable and you want to achieve the same results for each group
  mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
  group_by(depth) %>% # add group variable for more layers
  summarise(cum_unique_entries = last(cum_unique_entries))

edited Jan 24 '22 at 10:05

answered Sep 01 '18 at 20:43

MLE

1,033
1
11
30

1

This worked nicely for my problem, thanks for the answer! – Andrew Brēza Aug 02 '21 at 14:33

score 8 · Answer 3 · edited Oct 25 '14 at 13:26

8

Here is another attempt:

numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)

Which gives:

It seems to work with @Arun's example, too:

edited Oct 25 '14 at 13:26

answered Mar 29 '13 at 10:19

juba

47,631
14
113
118

1

I'm not entirely sure, but it appears that both `depth` and `value` must be simultaneously sorted. For instance, this method won't count the unique occurrence of `c` no matter how you `setkey()` this `data.table`: `mydf = data.table(data.frame(depth=c(1,1,2,2,6,7), value=c("a", "b", "g", "h", "b", "c")))`. – ecoe Dec 15 '15 at 12:45

score 7 · Answer 4 · answered Mar 29 '13 at 06:37

A good first step would be to create a column of TRUE or FALSE, where it is TRUE for the first of each value and FALSE for later appearances of that value. This can be done easily using duplicated:

mydata$first.appearance = !duplicated(mydata$value)

Reshaping the data is best done using aggregate. In this case, it says to sum over the first.appearance column within each subset of depth:

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

The result will look like:

  depth first.appearance
1     1  2
2     2  0
3     3  1

This is still not a cumulative sum, though. For that you can use the cumsum function (and then get rid of your old column):

newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

So to recap:

mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

Output:

  depth cumsum
1     1      2
2     2      2
3     3      3

nice! you can combine this w/ data.table `DT[, .(depth, unique.count = cumsum(!duplicated(value)))][, .(cumsum = max(unique.count)), by = .(depth)]` — Ethan, May 28 '22 at 12:59

G. Grothendieck · Answer 5 · 2013-03-29T15:16:19.127

5

This can be written in a relatively clean fashion with a single SQL statement using the sqldf package. Assume DF is the original data frame:

library(sqldf)

sqldf("select b.depth, count(distinct a.value) as cumsum
    from DF a join DF b 
    on a.depth <= b.depth
    group by b.depth"
)

edited Mar 29 '13 at 15:16

answered Mar 29 '13 at 15:03

G. Grothendieck

254,981
17
203
341

This is very useful assuming `depth` is numeric. If `depth` is a string or string representation of a date, as it was in my case, it can be a very expensive operation. – ecoe Dec 15 '15 at 12:43
1

In many cases the speed is unimportant and clarity is the more important issue. If performance is important then you really have to test it rather than make assumptions and if found too slow add an index and test it again. – G. Grothendieck Dec 15 '15 at 13:01

Didzis Elferts · Answer 6 · 2013-03-29T06:56:34.677

Here is another solution using lapply(). With unique(df$depth) make vector of unique depth values and then for each such value subset only those value values where depth is equal or less than particular depth value. Then calculate length of unique value values. This length value is stored in cumsum, then depth=x will give value of particular depth level. With do.call(rbind,...) make it as one data frame.

do.call(rbind,lapply(unique(df$depth), 
               function(x)
             data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
  depth cumsum
1     1      2
2     2      2
3     3      3

Cumulative count of unique values in R

6 Answers6

Linked