6

I have the following data

Date           Col1       Col2
2014-01-01     123        12
2014-01-01     123        21
2014-01-01     124        32
2014-01-01     125        32
2014-01-02     123        34
2014-01-02     126        24
2014-01-02     127        23
2014-01-03     521        21
2014-01-03     123        13
2014-01-03     126        15

Now, I want to count unique values in Col1 for the each date (that did not repeat in previous date), and add to the previous count. For example,

Date           Count
2014-01-01       3 i.e. 123,124,125
2014-01-02       5 (2 + above 3) i.e. 126, 127
2014-01-03       6 (1 + above 5) i.e. 521 only
BigDataScientist
  • 1,045
  • 5
  • 17
  • 37

2 Answers2

17
library(dplyr)
df %.% 
  arrange(Date) %.% 
  filter(!duplicated(Col1)) %.% 
  group_by(Date) %.% 
  summarise(Count=n()) %.% # n() <=> length(Date)
  mutate(Count = cumsum(Count))
# Source: local data frame [3 x 2]
# 
#         Date Count
# 1 2014-01-01     3
# 2 2014-01-02     5
# 3 2014-01-03     6

library(data.table)
dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
(dt <- dt[, list(Count=.N), by=Date][, Count:=cumsum(Count)])
#          Date Count
# 1: 2014-01-01     3
# 2: 2014-01-02     5
# 3: 2014-01-03     6

Or

dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
dt[, .N, by=Date][, Count:=cumsum(N)]

.N is named N (no dot) automatically for convenience in chained operations like this, so you can use both .N and N together in the next operation if need be.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • 8
    Great! Thank you. I like data.table option better. – BigDataScientist Jan 28 '14 at 21:50
  • 2
    Thanks @Arun. I'm rather new to data table. But I'm starting to love it. :) – lukeA Jan 28 '14 at 21:54
  • 2
    It may be more efficient (in `dplyr`) to use `filter` to de-duplicate on `col1` , i.e. `df %.% arrange(Date) %.% filter(!duplicated(Col1)) %.% group_by(Date) %.% summarise(Count=length(Date)) %.% mutate(Count = cumsum(count))`. – mnel Jan 28 '14 at 21:57
  • 1
    I'm with @mnel - the dplyr code would look a lot better if you consistently used `%.%` – hadley Jan 28 '14 at 22:46
  • @mnel The use of `n()` instead of `length` will be probably better too. cc @hadley – dickoa Jan 28 '14 at 23:03
  • @lukeA, I have another column now which is named 'Type' in that I have 'Type1', 'Type2' and 'Type3'. I want cumulative sum based on 'Type' as well. – BigDataScientist Feb 11 '14 at 17:52
  • @lukaA, I figured this out, it will be `dt <- data.table(df, key=c("Date",'Type')) dt <- unique(dt, by="Col1") (dt <- dt[, list(Count=.N), by=c('Date','Type')][, Count:=cumsum(Count), by=c('Type')])` – BigDataScientist Feb 11 '14 at 17:59
0

With ddply and duplicated, you just have to do

df <- ddply(data, .(Date, Col1), nrow)
df2 <- ddply(df[!duplicated(df$Col1),], .(Date), nrow)
ddply(df2, .(Date, V1), nrow)

ie you first count for all couples Date, Col1, then you remove the duplicated columns. You finally count the colums.

Your data must be sorted before.

MatthieuBizien
  • 1,647
  • 1
  • 10
  • 19