21

A simplified version of my data set would look like:

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

I would like to make a new data set where, for each value of "depth", I would have the cumulative number of unique values, starting from the top. e.g.

depth cumsum
 1      2
 2      2
 3      3

Any ideas as to how to do this? I am relatively new to R.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
user2223405
  • 343
  • 1
  • 3
  • 6

6 Answers6

14

I find this a perfect case of using factor and setting levels carefully. I'll use data.table here with this idea. Make sure your value column is character (not an absolute requirement).

  • step 1: Get your data.frame converted to data.table by taking just unique rows.

    require(data.table)
    dt <- as.data.table(unique(df))
    setkey(dt, "depth") # just to be sure before factoring "value"
    
  • step 2: Convert value to a factor and coerce to numeric. Make sure to set the levels yourself (it is important).

    dt[, id := as.numeric(factor(value, levels = unique(value)))]
    
  • step 3: Set key column to depth for subsetting and just pick the last value

     setkey(dt, "depth", "id")
     dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]
    
    #    depth id
    # 1:     1  2
    # 2:     2  2
    # 3:     3  3
    
  • step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax to get the final output.

    dt.out[, id := cummax(id)]
    

Edit: The above code was for illustrative purposes. In reality you don't need a 3rd column at all. This is how I'd write the final code.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

Here's a more tricky example and the output from the code:

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 5
    Here's a `dplyr` version: `df %>% arrange(depth) %>% mutate(value = cummax(as.numeric(factor(value, levels = unique(value))))) %>% arrange(depth, desc(value)) %>% distinct(depth)`. – Jake Fisher Mar 24 '15 at 01:46
  • 1
    This method can be generally applied for when both `depth` and `value` are string values. Thanks! – ecoe Dec 15 '15 at 12:47
  • @Arun This is a great solution! Thanks! – asterx Oct 21 '16 at 04:21
  • What if we have a third variable called group and we want to achieve the same results for each group? – MLE Sep 01 '18 at 17:53
11

A dplyr attempt.

df %>%
  #group_by(group)%>% # if you have a third variable and you want to achieve the same results for each group
  mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
  group_by(depth) %>% # add group variable for more layers
  summarise(cum_unique_entries = last(cum_unique_entries))
MLE
  • 1,033
  • 1
  • 11
  • 30
8

Here is another attempt:

numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)

Which gives:

  depth x
1     1 2
2     2 2
3     3 3

It seems to work with @Arun's example, too:

  depth x
1     1 2
2     2 4
3     3 4
4     4 5
5     5 6
6     6 6
juba
  • 47,631
  • 14
  • 113
  • 118
  • 1
    I'm not entirely sure, but it appears that both `depth` and `value` must be simultaneously sorted. For instance, this method won't count the unique occurrence of `c` no matter how you `setkey()` this `data.table`: `mydf = data.table(data.frame(depth=c(1,1,2,2,6,7), value=c("a", "b", "g", "h", "b", "c")))`. – ecoe Dec 15 '15 at 12:45
7

A good first step would be to create a column of TRUE or FALSE, where it is TRUE for the first of each value and FALSE for later appearances of that value. This can be done easily using duplicated:

mydata$first.appearance = !duplicated(mydata$value)

Reshaping the data is best done using aggregate. In this case, it says to sum over the first.appearance column within each subset of depth:

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

The result will look like:

  depth first.appearance
1     1  2
2     2  0
3     3  1

This is still not a cumulative sum, though. For that you can use the cumsum function (and then get rid of your old column):

newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

So to recap:

mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

Output:

  depth cumsum
1     1      2
2     2      2
3     3      3
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • nice! you can combine this w/ data.table `DT[, .(depth, unique.count = cumsum(!duplicated(value)))][, .(cumsum = max(unique.count)), by = .(depth)]` – Ethan May 28 '22 at 12:59
5

This can be written in a relatively clean fashion with a single SQL statement using the sqldf package. Assume DF is the original data frame:

library(sqldf)

sqldf("select b.depth, count(distinct a.value) as cumsum
    from DF a join DF b 
    on a.depth <= b.depth
    group by b.depth"
)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is very useful assuming `depth` is numeric. If `depth` is a string or string representation of a date, as it was in my case, it can be a very expensive operation. – ecoe Dec 15 '15 at 12:43
  • 1
    In many cases the speed is unimportant and clarity is the more important issue. If performance is important then you really have to test it rather than make assumptions and if found too slow add an index and test it again. – G. Grothendieck Dec 15 '15 at 13:01
1

Here is another solution using lapply(). With unique(df$depth) make vector of unique depth values and then for each such value subset only those value values where depth is equal or less than particular depth value. Then calculate length of unique value values. This length value is stored in cumsum, then depth=x will give value of particular depth level. With do.call(rbind,...) make it as one data frame.

do.call(rbind,lapply(unique(df$depth), 
               function(x)
             data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
  depth cumsum
1     1      2
2     2      2
3     3      3
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201